MiniCPM-o: Open-Source Multimodal LLM for Vision, Speech, and Text

MiniCPM-o is a series of open-source multimodal LLMs capable of processing vision, speech, and text simultaneously, outperforming GPT-4o in single image understanding.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 03, 2026 5 min read

Multimodal AI models that can simultaneously process vision, speech, and text represent the cutting edge of artificial intelligence. OpenAI’s GPT-4o demonstrated the potential of this approach, but its closed nature has left the open-source community racing to catch up. MiniCPM-o, developed by OpenBMB (offshoot of Tsinghua University’s NLP lab), has achieved a remarkable milestone: it outperforms GPT-4o on single-image understanding benchmarks while matching or exceeding it on speech tasks – all in an open-source package.

The project at github.com/OpenBMB/MiniCPM-o represents a series of multimodal LLMs that extend the MiniCPM family’s impressive performance-to-size ratio into the multimodal domain. MiniCPM-o supports full-duplex voice interaction – meaning it can listen and speak simultaneously, like a natural conversation – along with image understanding, optical character recognition, and multi-turn dialogue capabilities.

What makes MiniCPM-o particularly remarkable is the efficiency of its architecture. While GPT-4o likely requires enormous computational resources, MiniCPM-o achieves competitive or superior results on key benchmarks with a model that can run on consumer hardware. This democratization of multimodal AI capabilities has made it one of the most important open-source AI releases of recent years.

What is MiniCPM-o?

MiniCPM-o is a series of open-source multimodal LLMs that process vision, speech, and text simultaneously. Developed by OpenBMB, it builds on the MiniCPM language model family and extends it with visual and speech understanding capabilities. It supports full-duplex voice interaction, single and multi-image understanding, and achieves state-of-the-art results on several key benchmarks.

What model versions are available?

MiniCPM-o comes in several variants optimized for different use cases.

Model	Parameters	Modalities	Key Strength
MiniCPM-o 2.6	8B	Vision + Text	Best image understanding in class
MiniCPM-o 2.6 (Speech)	8B	Vision + Speech + Text	Full-duplex voice interaction
MiniCPM-V 2.6	8B	Vision + Text	Pure VLM, lower resource usage
MiniCPM-Llama3-V 2.5	9B	Vision + Text	LLaMA-based, broader ecosystem

The 2.6 release is the current flagship, introducing speech capabilities that were absent in earlier versions.

What full-duplex capabilities does MiniCPM-o offer?

Full-duplex voice interaction is MiniCPM-o’s standout feature – it can listen and speak simultaneously, like a human conversation.

Capability	Description	Latency
Real-time ASR	Automatic speech recognition during speech	<200ms
Voice activity detection	Detect when user starts/stops speaking	<100ms
Simultaneous listening + generating	Generate response while user is still speaking	Real-time
Emotional speech synthesis	Generate speech with appropriate emotional tone	<300ms
Multi-turn conversation	Maintain context across voice turns	N/A
Interruption handling	Gracefully handle being interrupted mid-response	<150ms

This full-duplex capability makes MiniCPM-o suitable for voice assistants, call center automation, and interactive voice applications.

How does MiniCPM-o perform compared to GPT-4o?

MiniCPM-o achieves remarkable results on standard benchmarks, often matching or exceeding GPT-4o.

Benchmark	MiniCPM-o 2.6	GPT-4o	Category
MMLU (language)	72.3	88.7	General knowledge
MMBench (single image)	82.1	80.4	Image understanding
MMMU (multi-discipline)	57.5	69.1	Advanced reasoning
OCRBench (text in images)	82.8	76.3	OCR quality
HallusionBench (visual QA)	53.2	53.8	Visual hallucination
MathVista (visual math)	64.5	63.8	Mathematical reasoning

On single-image understanding (MMBench) and OCR tasks (OCRBench), MiniCPM-o 2.6 actually outperforms GPT-4o. On general knowledge (MMLU) and multi-discipline reasoning (MMMU), GPT-4o maintains a lead.

What hardware is required to run MiniCPM-o?

MiniCPM-o is designed to be accessible on consumer hardware, unlike many competing multimodal models.

# Install with Transformers
pip install transformers torch

# Load MiniCPM-o 2.6
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-2_6",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-o-2_6", trust_remote_code=True)

Hardware	Model Size	Inference Speed	Notes
RTX 4090 (24GB VRAM)	8B	25-30 tokens/s	Full model on single GPU
RTX 3090 (24GB VRAM)	8B	20-25 tokens/s	Full model on single GPU
RTX 4060 (8GB VRAM)	8B (4-bit)	15-20 tokens/s	Requires quantization
Apple M2/M3 (16GB+)	8B	10-15 tokens/s	Via MLX or llama.cpp
CPU only	8B (4-bit)	3-5 tokens/s	Very slow, not recommended

Frequently Asked Questions

What is MiniCPM-o?

MiniCPM-o is an open-source multimodal LLM series from OpenBMB that processes vision, speech, and text simultaneously. It supports full-duplex voice interaction and outperforms GPT-4o on single-image understanding benchmarks.

What model versions are available?

The flagship MiniCPM-o 2.6 (8B parameters) comes in Vision+Text and Vision+Speech+Text variants. Earlier versions include MiniCPM-V 2.6 and MiniCPM-Llama3-V 2.5.

What full-duplex capabilities does MiniCPM-o offer?

Full-duplex voice interaction includes real-time ASR, voice activity detection, simultaneous listening and generating, emotional speech synthesis, multi-turn conversation, and interruption handling – all with sub-300ms latency.

How does MiniCPM-o compare to GPT-4o on benchmarks?

MiniCPM-o 2.6 outperforms GPT-4o on single-image understanding (MMBench: 82.1 vs 80.4) and OCR (OCRBench: 82.8 vs 76.3). GPT-4o leads on general knowledge (MMLU: 88.7 vs 72.3) and multi-discipline reasoning (MMMU: 69.1 vs 57.5).

What hardware is required to run MiniCPM-o?

The 8B model runs on a single RTX 4090/3090 with 24GB VRAM. With 4-bit quantization, it runs on 8GB GPUs. Apple Silicon users can use MLX for reasonable performance.

MiniCPM-o: Open-Source Multimodal LLM for Vision, Speech, and Text

What is MiniCPM-o?

What model versions are available?

What full-duplex capabilities does MiniCPM-o offer?

How does MiniCPM-o perform compared to GPT-4o?

What hardware is required to run MiniCPM-o?

Frequently Asked Questions

What is MiniCPM-o?

What model versions are available?

What full-duplex capabilities does MiniCPM-o offer?

How does MiniCPM-o compare to GPT-4o on benchmarks?

What hardware is required to run MiniCPM-o?

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES