AI

MiniCPM-o: Open-Source Multimodal LLM for Vision, Speech, and Text

MiniCPM-o is a series of open-source multimodal LLMs capable of processing vision, speech, and text simultaneously, outperforming GPT-4o in single image understanding.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
MiniCPM-o: Open-Source Multimodal LLM for Vision, Speech, and Text

Multimodal AI models that can simultaneously process vision, speech, and text represent the cutting edge of artificial intelligence. OpenAI’s GPT-4o demonstrated the potential of this approach, but its closed nature has left the open-source community racing to catch up. MiniCPM-o, developed by OpenBMB (offshoot of Tsinghua University’s NLP lab), has achieved a remarkable milestone: it outperforms GPT-4o on single-image understanding benchmarks while matching or exceeding it on speech tasks – all in an open-source package.

The project at github.com/OpenBMB/MiniCPM-o represents a series of multimodal LLMs that extend the MiniCPM family’s impressive performance-to-size ratio into the multimodal domain. MiniCPM-o supports full-duplex voice interaction – meaning it can listen and speak simultaneously, like a natural conversation – along with image understanding, optical character recognition, and multi-turn dialogue capabilities.

What makes MiniCPM-o particularly remarkable is the efficiency of its architecture. While GPT-4o likely requires enormous computational resources, MiniCPM-o achieves competitive or superior results on key benchmarks with a model that can run on consumer hardware. This democratization of multimodal AI capabilities has made it one of the most important open-source AI releases of recent years.

What is MiniCPM-o?

MiniCPM-o is a series of open-source multimodal LLMs that process vision, speech, and text simultaneously. Developed by OpenBMB, it builds on the MiniCPM language model family and extends it with visual and speech understanding capabilities. It supports full-duplex voice interaction, single and multi-image understanding, and achieves state-of-the-art results on several key benchmarks.

What model versions are available?

MiniCPM-o comes in several variants optimized for different use cases.

ModelParametersModalitiesKey Strength
MiniCPM-o 2.68BVision + TextBest image understanding in class
MiniCPM-o 2.6 (Speech)8BVision + Speech + TextFull-duplex voice interaction
MiniCPM-V 2.68BVision + TextPure VLM, lower resource usage
MiniCPM-Llama3-V 2.59BVision + TextLLaMA-based, broader ecosystem

The 2.6 release is the current flagship, introducing speech capabilities that were absent in earlier versions.

What full-duplex capabilities does MiniCPM-o offer?

Full-duplex voice interaction is MiniCPM-o’s standout feature – it can listen and speak simultaneously, like a human conversation.

CapabilityDescriptionLatency
Real-time ASRAutomatic speech recognition during speech<200ms
Voice activity detectionDetect when user starts/stops speaking<100ms
Simultaneous listening + generatingGenerate response while user is still speakingReal-time
Emotional speech synthesisGenerate speech with appropriate emotional tone<300ms
Multi-turn conversationMaintain context across voice turnsN/A
Interruption handlingGracefully handle being interrupted mid-response<150ms

This full-duplex capability makes MiniCPM-o suitable for voice assistants, call center automation, and interactive voice applications.

How does MiniCPM-o perform compared to GPT-4o?

MiniCPM-o achieves remarkable results on standard benchmarks, often matching or exceeding GPT-4o.

BenchmarkMiniCPM-o 2.6GPT-4oCategory
MMLU (language)72.388.7General knowledge
MMBench (single image)82.180.4Image understanding
MMMU (multi-discipline)57.569.1Advanced reasoning
OCRBench (text in images)82.876.3OCR quality
HallusionBench (visual QA)53.253.8Visual hallucination
MathVista (visual math)64.563.8Mathematical reasoning

On single-image understanding (MMBench) and OCR tasks (OCRBench), MiniCPM-o 2.6 actually outperforms GPT-4o. On general knowledge (MMLU) and multi-discipline reasoning (MMMU), GPT-4o maintains a lead.

What hardware is required to run MiniCPM-o?

MiniCPM-o is designed to be accessible on consumer hardware, unlike many competing multimodal models.

# Install with Transformers
pip install transformers torch

# Load MiniCPM-o 2.6
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-2_6",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-o-2_6", trust_remote_code=True)
HardwareModel SizeInference SpeedNotes
RTX 4090 (24GB VRAM)8B25-30 tokens/sFull model on single GPU
RTX 3090 (24GB VRAM)8B20-25 tokens/sFull model on single GPU
RTX 4060 (8GB VRAM)8B (4-bit)15-20 tokens/sRequires quantization
Apple M2/M3 (16GB+)8B10-15 tokens/sVia MLX or llama.cpp
CPU only8B (4-bit)3-5 tokens/sVery slow, not recommended

Frequently Asked Questions

What is MiniCPM-o?

MiniCPM-o is an open-source multimodal LLM series from OpenBMB that processes vision, speech, and text simultaneously. It supports full-duplex voice interaction and outperforms GPT-4o on single-image understanding benchmarks.

What model versions are available?

The flagship MiniCPM-o 2.6 (8B parameters) comes in Vision+Text and Vision+Speech+Text variants. Earlier versions include MiniCPM-V 2.6 and MiniCPM-Llama3-V 2.5.

What full-duplex capabilities does MiniCPM-o offer?

Full-duplex voice interaction includes real-time ASR, voice activity detection, simultaneous listening and generating, emotional speech synthesis, multi-turn conversation, and interruption handling – all with sub-300ms latency.

How does MiniCPM-o compare to GPT-4o on benchmarks?

MiniCPM-o 2.6 outperforms GPT-4o on single-image understanding (MMBench: 82.1 vs 80.4) and OCR (OCRBench: 82.8 vs 76.3). GPT-4o leads on general knowledge (MMLU: 88.7 vs 72.3) and multi-discipline reasoning (MMMU: 69.1 vs 57.5).

What hardware is required to run MiniCPM-o?

The 8B model runs on a single RTX 4090/3090 with 24GB VRAM. With 4-bit quantization, it runs on 8GB GPUs. Apple Silicon users can use MLX for reasonable performance.

Further Reading

TAG
CATEGORIES