AI

InternVL: Open-Source Vision Language Model Family Scaling to 241B Parameters

InternVL from Shanghai AI Lab scales vision transformers to 6B parameters and aligns them with LLMs, achieving GPT-4o-level multimodal performance.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
InternVL: Open-Source Vision Language Model Family Scaling to 241B Parameters

InternVL is a series of open-source vision-language foundation models developed by OpenGVLab at the Shanghai Artificial Intelligence Laboratory. The InternVL family scales vision transformers to 6 billion parameters and progressively aligns them with large language models, creating a unified architecture that achieves GPT-4o-level performance across a wide range of multimodal benchmarks. The flagship InternVL2.5-241B model represents one of the largest open-source multimodal models ever released.

The project has been recognized at CVPR 2024 and has garnered significant attention for demonstrating that open-source vision-language models can match or exceed proprietary systems when scaled appropriately. InternVL’s architecture handles tasks spanning image captioning, visual question answering, document understanding, chart analysis, and multi-image reasoning, making it a versatile foundation for multimodal AI applications.

How does InternVL’s architecture work?

InternVL uses a progressive alignment strategy. The vision encoder (InternViT) is pre-trained at scale – up to 6B parameters – and then aligned with an LLM through a carefully designed dynamic high-resolution processing mechanism. Unlike earlier VLMs that downsample images to fixed low resolutions, InternVL processes images at their native aspect ratio by dynamically dividing them into tiles, each processed at high resolution and then merged for global understanding.

What model sizes are available?

ModelVision EncoderLLM BackboneTotal ParametersContext Window
InternVL2-1B300M0.5B1B128K
InternVL2-8B300M7B8B128K
InternVL2-26B300M25B26B128K
InternVL2-76B6B70B76B128K
InternVL2.5-241B6B235B241B256K

Benchmark Performance

InternVL2.5-241B achieves competitive or state-of-the-art results across major multimodal benchmarks, often matching or exceeding GPT-4o and Gemini Ultra on vision-language tasks.

BenchmarkInternVL2.5-241BGPT-4oGemini Ultra 1.5InternVL2-76B
MMMU (val)72.1%69.1%62.2%65.4%
MathVista66.8%63.8%61.3%60.2%
ChartQA85.3%81.6%79.8%80.1%
DocVQA92.7%90.2%88.9%88.5%
OCRBench851828810812

What is dynamic high-resolution processing?

Traditional VLMs resize all input images to a fixed resolution, losing critical detail for tasks like document understanding or chart reading. InternVL’s dynamic tiling approach preserves the original aspect ratio by dividing images into tiles of 448x448 pixels. Each tile is processed independently by the vision encoder at full resolution, and the resulting features are merged with global context to maintain both detail and holistic understanding. This is especially valuable for dense text documents, scientific figures, and UI screenshots where fine details matter.

What is the license for InternVL?

InternVL is released under the MIT License or Apache 2.0 depending on the specific model version. The model weights are freely available on Hugging Face, and the training code, inference scripts, and evaluation benchmarks are all open-source. This permissive licensing has enabled widespread adoption in both academic research and commercial applications, including use in document processing pipelines, accessibility tools, and multimodal search systems.

Can InternVL handle video input?

While InternVL is primarily designed for image understanding, the architecture naturally extends to video by processing frames as a sequence of images. The model can reason across multiple frames using its extended context window, supporting tasks like video captioning, activity recognition, and temporal reasoning. The 256K token context window in InternVL2.5-241B allows processing dozens of high-resolution frames in a single forward pass.

How does InternVL compare to other open-source VLMs?

InternVL consistently outperforms other open-source VLMs like LLaVA, Qwen-VL, and CogVLM on standard benchmarks, particularly on tasks requiring high-resolution understanding such as OCR and document parsing. The 241B variant brings open-source VLM performance into direct competition with proprietary systems for the first time. The intermediate model sizes (8B, 26B) offer practical trade-offs for deployment scenarios where computational budget is limited.

Frequently Asked Questions

What is InternVL? InternVL is an open-source vision-language model family developed by Shanghai AI Lab that scales vision transformers to 6B parameters, achieving GPT-4o-level performance.

What model versions are available? Sizes range from 1B to 241B parameters, with InternVL2.5-241B being the flagship model offering 256K context and state-of-the-art multimodal performance.

What is the architecture? InternVL uses a progressive alignment strategy with a large-scale InternViT vision encoder, an MLP projector, and a standard LLM backbone with dynamic high-resolution tiling.

How does it perform on benchmarks? InternVL2.5-241B achieves competitive results on MMMU (72.1%), MathVista (66.8%), ChartQA (85.3%), and DocVQA (92.7%), often matching or exceeding GPT-4o.

What license is used? InternVL is released under the MIT License or Apache 2.0, with model weights freely available on Hugging Face for both research and commercial use.

Further Reading

TAG
CATEGORIES