InternVL: Open-Source Vision Language Model Family Scaling to 241B Parameters

InternVL from Shanghai AI Lab scales vision transformers to 6B parameters and aligns them with LLMs, achieving GPT-4o-level multimodal performance.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 5 min read

InternVL is a series of open-source vision-language foundation models developed by OpenGVLab at the Shanghai Artificial Intelligence Laboratory. The InternVL family scales vision transformers to 6 billion parameters and progressively aligns them with large language models, creating a unified architecture that achieves GPT-4o-level performance across a wide range of multimodal benchmarks. The flagship InternVL2.5-241B model represents one of the largest open-source multimodal models ever released.

The project has been recognized at CVPR 2024 and has garnered significant attention for demonstrating that open-source vision-language models can match or exceed proprietary systems when scaled appropriately. InternVL’s architecture handles tasks spanning image captioning, visual question answering, document understanding, chart analysis, and multi-image reasoning, making it a versatile foundation for multimodal AI applications.

How does InternVL’s architecture work?

InternVL uses a progressive alignment strategy. The vision encoder (InternViT) is pre-trained at scale – up to 6B parameters – and then aligned with an LLM through a carefully designed dynamic high-resolution processing mechanism. Unlike earlier VLMs that downsample images to fixed low resolutions, InternVL processes images at their native aspect ratio by dynamically dividing them into tiles, each processed at high resolution and then merged for global understanding.

flowchart LR
    A[Input Image] --> B[Dynamic Tiling]
    B --> C[InternViT - 6B Vision Encoder]
    C --> D[MLP Projector]
    D --> E[LLM Backbone]
    F[Text Input] --> G[Text Tokenizer]
    G --> E
    E --> H[Multimodal Output]
    H --> I[Captioning]
    H --> J[VQA]
    H --> K[Document Understanding]

What model sizes are available?

Model	Vision Encoder	LLM Backbone	Total Parameters	Context Window
InternVL2-1B	300M	0.5B	1B	128K
InternVL2-8B	300M	7B	8B	128K
InternVL2-26B	300M	25B	26B	128K
InternVL2-76B	6B	70B	76B	128K
InternVL2.5-241B	6B	235B	241B	256K

Benchmark Performance

InternVL2.5-241B achieves competitive or state-of-the-art results across major multimodal benchmarks, often matching or exceeding GPT-4o and Gemini Ultra on vision-language tasks.

Benchmark	InternVL2.5-241B	GPT-4o	Gemini Ultra 1.5	InternVL2-76B
MMMU (val)	72.1%	69.1%	62.2%	65.4%
MathVista	66.8%	63.8%	61.3%	60.2%
ChartQA	85.3%	81.6%	79.8%	80.1%
DocVQA	92.7%	90.2%	88.9%	88.5%
OCRBench	851	828	810	812

What is dynamic high-resolution processing?

Traditional VLMs resize all input images to a fixed resolution, losing critical detail for tasks like document understanding or chart reading. InternVL’s dynamic tiling approach preserves the original aspect ratio by dividing images into tiles of 448x448 pixels. Each tile is processed independently by the vision encoder at full resolution, and the resulting features are merged with global context to maintain both detail and holistic understanding. This is especially valuable for dense text documents, scientific figures, and UI screenshots where fine details matter.

sequenceDiagram
    participant Image as Input Image
    participant Tiler as Dynamic Tiler
    participant ViT as InternViT Encoder
    participant Merger as Feature Merger
    participant LLM as Language Model

    Image->>Tiler: 1920x1080 image
    Tiler->>Tiler: Calculate optimal tiles
    Tiler->>ViT: Tile 1 (448x448)
    Tiler->>ViT: Tile 2 (448x448)
    Tiler->>ViT: Tile 3 (448x448)
    Tiler->>ViT: Tile N...
    ViT-->>Merger: Per-tile features
    Image->>Merger: Global thumbnail feature
    Merger->>Merger: Concatenate + project
    Merger->>LLM: Unified multimodal tokens
    LLM-->>User: Text response

What is the license for InternVL?

InternVL is released under the MIT License or Apache 2.0 depending on the specific model version. The model weights are freely available on Hugging Face, and the training code, inference scripts, and evaluation benchmarks are all open-source. This permissive licensing has enabled widespread adoption in both academic research and commercial applications, including use in document processing pipelines, accessibility tools, and multimodal search systems.

Can InternVL handle video input?

While InternVL is primarily designed for image understanding, the architecture naturally extends to video by processing frames as a sequence of images. The model can reason across multiple frames using its extended context window, supporting tasks like video captioning, activity recognition, and temporal reasoning. The 256K token context window in InternVL2.5-241B allows processing dozens of high-resolution frames in a single forward pass.

How does InternVL compare to other open-source VLMs?

InternVL consistently outperforms other open-source VLMs like LLaVA, Qwen-VL, and CogVLM on standard benchmarks, particularly on tasks requiring high-resolution understanding such as OCR and document parsing. The 241B variant brings open-source VLM performance into direct competition with proprietary systems for the first time. The intermediate model sizes (8B, 26B) offer practical trade-offs for deployment scenarios where computational budget is limited.

Frequently Asked Questions

What is InternVL? InternVL is an open-source vision-language model family developed by Shanghai AI Lab that scales vision transformers to 6B parameters, achieving GPT-4o-level performance.

What model versions are available? Sizes range from 1B to 241B parameters, with InternVL2.5-241B being the flagship model offering 256K context and state-of-the-art multimodal performance.

What is the architecture? InternVL uses a progressive alignment strategy with a large-scale InternViT vision encoder, an MLP projector, and a standard LLM backbone with dynamic high-resolution tiling.

How does it perform on benchmarks? InternVL2.5-241B achieves competitive results on MMMU (72.1%), MathVista (66.8%), ChartQA (85.3%), and DocVQA (92.7%), often matching or exceeding GPT-4o.

What license is used? InternVL is released under the MIT License or Apache 2.0, with model weights freely available on Hugging Face for both research and commercial use.

InternVL: Open-Source Vision Language Model Family Scaling to 241B Parameters

How does InternVL’s architecture work?

What model sizes are available?

Benchmark Performance

What is dynamic high-resolution processing?

What is the license for InternVL?

Can InternVL handle video input?

How does InternVL compare to other open-source VLMs?

Frequently Asked Questions

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES