Survey

Toward Native Multimodal Modeling:
A Roadmap

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. We formalize the architectural nativity, distinguish mid-fusion and early-fusion from non-native paradigms, and survey the end-to-end NMM pipeline along the input–output duality axis (M2T / M2G / M2M) across architecture, data, training, inference, and evaluation.

TL;DR

Contributions

i

Problem Formalization

We first present the formal, systemic definition of NMM, establishing a principled structural taxonomy based on integration depth (mid-/early-fusion) and input–output duality (Multi-to-{Text, Target, Multi}) to clarify the fragmented design space.

ii

Technological Roadmap

We systematically analyze the full lifecycle of NMM, extracting the core modal bottlenecks and cross-cutting solutions across architectural designs, data curricula, training strategies, inference deployment, and holistic evaluation.

iii

Future Outlook

We carefully provide empirical insights from state-of-the-art implementations and paradigms to deliver a visionary projection of future trajectories, suggesting crucial strategic directions for the evolution toward advanced NMM.

Taxonomy

A two-axis map of NMM

Click any cell to see representative models for that combination of fusion regime and input–output duality.

M2T Multi-to-Text M2G Multi-to-Target M2M Multi-to-Multi

Model Explorer

Search the NMM landscape

Category:
Sort:
Filter by modality:
Model Cat. Date Params Input (T·I·A·V) Output (T·I·A·V)

Filled dots = supported modality.

Paper Outline

Five pillars of the NMM roadmap

1

Model Architecture

NMM systems assign distinct functional roles to comprehend and generate different modalities. The functional categories are defined by their input–output configurations (M2T / M2G / M2M), orthogonal to the architectural taxonomy of mid-fusion vs. early-fusion; each category contains representatives of both fusion paradigms.

  • M2T — Unimodal Generation. Image/Audio/Video comprehension via modality unification, multi-image reasoning, multi-scale encoding, and perception-reasoning decoupling (Llama-4, Kimi K2.5, Qwen3-VL, GLM-5V-Turbo).
  • M2G — Scenario-based Generation. High-fidelity image generation, low-latency audio synthesis with streaming synergy, and video generation with audio–visual alignment via unified timelines (MiniCPM-o 4.5, Qwen3-Omni).
  • M2M — Symmetric Modeling. Two competing routes: fully-discretized unified (loss from discretizing, competition-driven latency) vs. modality-specificity preserving (comprehension–generation dilemma, bridging AR and Diffusion).
A hierarchical taxonomy of major technical challenges, core design axes, and representative NMM systems

Figure 4: A hierarchical taxonomy of the major technical challenges, core design axes, and representative NMM systems, derived from the discussion in Section 3.

2

Training Techniques

Each fusion regime imposes a distinct training signature across five dimensions: freezing topology, learning-rate topology, loss formulation, stability prescription, and curriculum scheduling. SFT and RL inherit this signature and add two further regime-specific axes — an SFT-time freezing rewiring only mid-fusion can perform, and an RL-time policy scope dictated by fusion, not algorithm.

  • PTMid-fusion: progressive unfreezing + differential LR. Early-fusion: joint-from-start with z-loss & QK-Norm as preconditions.
  • SFTOnly mid-fusion can rewire freezing; early-fusion can only rebalance the modality mixture.
  • RLOPD / MOPD emerges as the terminal consolidation step against grounding-hack & see-saw failures.
Training pipeline across fusion regimes
3

Datasets

Native multimodal data spans heterogeneous mixtures across text, image, video, audio, documents, GUI states, tool-use traces, and preference signals. We organize them by functional roleUnderstand, Generate, Interact, Align — layered across PT → SFT → RL.

Category Sub-type Representative Datasets Modalities Key Supervision / Description
Understand Image-Text Alignment LAION-5B, COCO Captions, CC3M/CC12M, YFCC100M, DataComp T, I Weakly-aligned web-scale pairs for cross-modal mapping; image-level semantics.
VQA & Instruction Tuning VQA v2, GQA, OK-VQA, ScienceQA, LLaVA-Instruct, InstructBLIP T, I Task-driven QA and multi-turn instruction-following; perception → dialogue.
Interleaved & Multi-Image MMC4, OBELICS, OmniCorpus, MANTIS, NLVR2, MuirBench, BLINK T, I Interleaved documents and multi-image reasoning; long-range dependency.
Document, Chart & Grounding DocVQA, InfographicVQA, ChartQA, TextVQA, Flickr30k Entities, RefCOCO T, I Text, layouts, charts; region-level boxes and referring expressions.
Video & Audio Understanding MSR-VTT, ActivityNet, WebVid, AudioSet, LibriSpeech, Common Voice, Clotho T, I, V, A Temporal action/event recognition, ASR, acoustic scene understanding.
Generate Text-to-Image & Editing LAION-5B, DiffusionDB, InstructPix2Pix, MagicBrush, HQ-Edit, UltraEdit T, I Synthesis or editing from prompts; controlled transformation with masks.
Controllable Generation ControlNet, GLIGEN, T2I-Adapter, Composer T, I + cond. Grounded generation with depth, sketch, layout, bounding boxes, pose.
Interleaved Image-Text Gen. VIST, OpenLEAF, CoMM, InterSyn T, I Coherent multimodal sequences (stories, tutorials) with entity consistency.
Video Generation WebVid-10M, Panda-70M, OpenVid-1M, VidGen-1M T, I, V T2V/I2V; temporal coherence, motion quality, caption recaptioning.
Audio & Speech Generation LibriTTS, VCTK, GigaSpeech, Emilia, AudioCaps, WavCaps, MusicCaps T, A (Sp) TTS, voice cloning, music and environmental sound generation.
Interact Web Interaction WebShop, Mind2Web, WebArena, VisualWebArena, WebLINX, WebVoyager T, I (GUI) Goal-driven web navigation: search, click, form fill on real/sim sites.
Mobile & Desktop GUI AITW, RICO, ScreenAI, SeeClick, OSWorld, Windows Agent Arena T, I (GUI) Screenshot/UI-tree to action; mobile and OS environments.
Embodied Interaction ALFWorld, BridgeData V2, Open X-Embodiment, Magma T, I, V (robot) Language-conditioned manipulation from visual + robot states.
Align Hallucination & Faithfulness LLaVA-RLHF, RLHF-V, VLFeedback, RLAIF-V, HA-DPO, V-DPO T, I Comparative / span-level feedback to reduce hallucinations.
Safety Alignment SPA-VL, Safe RLHF-V T, I Safe/unsafe response pairs under multimodal harmful prompts.
Generation Quality Pref. ImageReward, Pick-a-Pic, HPS v2, VBench, VBench++ I, V Aesthetics, alignment, temporal consistency, motion quality.
Agentic Preference Environment Rewards, Human Demos T, I, A Env-aligned action correctness, efficiency, recovery.

Modalities: T = Text, I = Image, V = Video, A = Audio/Speech.

4

Inference & Deployment

Native pretraining amplifies the long-context problem: a single high-resolution image, document, or long video balloons into thousands or millions of tokens. Efficient serving therefore attacks sequence explosion, the heterogeneity-scale tension, and full-duplex streaming simultaneously.

  • Visual resampling, dynamic resolution, and spatially sparse perception (VisionZip, SparseVLM, FitPrune).
  • Pure discrete tokenization vs. MoE / hybrid routing trade-offs.
  • Full-duplex state management, adaptive-bitrate control, modality-aware mixed quantization.
5

Evaluation

Native architectures demand evaluation that simultaneously verifies understanding (perception, reasoning, grounding) and generation (synthesis, editing, controllability) without one-sided regression. We organize benchmarks by modality — Image / Audio / Video — and within each split understanding from generation.

Modality Task Group Benchmark Metric Key Characteristics
Image General Perception VQAv2 Acc. Open-ended VQA with balanced answer distribution.
GQA Acc. Compositional questions grounded on scene graphs.
SEED-Bench Acc. 12 evaluation dims across spatial & temporal reasoning.
MMBench Acc. Bilingual multi-choice with circular evaluation.
MMStar Acc. Vision-indispensable, leakage-controlled selection.
Knowledge Reasoning MMMU Acc. College-level reasoning over 30 disciplines.
MathVista Acc. Mathematical reasoning grounded in visual contexts.
Hallucination POPE F1 Polling-based binary probing of object hallucination.
RLHF-V Hall. Score Segment-level fine-grained hallucination evaluation.
Document & OCR DocVQA ANLS Question answering on document images.
ChartQA Acc. Visual and logical reasoning over charts and plots.
InfoVQA ANLS Multi-hop reasoning over infographic layouts.
OCRBench Acc. Comprehensive OCR perception across 29 sub-tasks.
Generation GenEval / DPG-Bench / T2I-CompBench / FID Comp. / Align. / Multi / Distrib. Attribute binding, dense prompts, compositional T2I, distributional FID.
Audio Speech Recognition LibriSpeech WER Read English speech, clean and other splits.
CommonVoice WER Crowdsourced multilingual ASR across diverse accents.
FLEURS WER Few-shot ASR across 102 languages.
Speech Synthesis MOS-Bench MOS Subjective rating of naturalness and prosody.
Full-Duplex Interaction Moshi Eval Latency Real-time full-duplex with 200 ms target latency.
Full-Duplex-Bench / SoulX-Duplug-Eval Multi / Lat. Turn-taking, barge-in handling, false-interruption rate.
Video Offline Understanding VideoMME Acc. General video QA spanning short to long durations.
EgoSchema Acc. Long-form egocentric video QA with temporal reasoning.
MVBench Acc. 20 fine-grained temporal tasks.
PerceptionTest Acc. Multimodal perception and causal-reasoning skill probe.
LongVideoBench Acc. Hour-long referring and reasoning over long contexts.
MLVU Acc. Multi-task long-video understanding.
Streaming Understanding OVO-Bench Multi Online perception with backward tracing of past events.
StreamingBench Acc./Lat. Video comprehension under latency constraints.
OmniMMI Multi Multimodal streaming interaction evaluation.
Generation UCF-101 / Kinetics-600 FVD Action-class video generation distributional metric.
VBench / SeedVideoBench 2.0 / Arena.AI Multi / 6-dim / Elo Temporal consistency, A/V sync, community Elo ranking.

Cite

BibTeX

paper.bib
@inproceedings{nmm2026roadmap,
  title     = {Toward Native Multimodal Modeling: A Roadmap},
  author    = {Siyu An and Junru Lu and Junnan Dong and Qiufeng Wang and Yinghui Li and Weizhi Fei and Zichao Yu and Zheng Yuan and Biao Liu and Haopeng Wang and Renzhao Liang and Yixuan Yang and Yunhang Shen and Bo Ke and Keyu Chen and Linhao Luo and Difan Zou and Xiao Huang and Di Yin and Ruizhi Qiao and Xing Sun},
  year      = {2026},
}