Deploying large language models in production is a fundamentally different challenge from training them. Training requires massive clusters and weeks of compute, but can tolerate batch processing and variable throughput. Production inference requires consistent sub-second latency, elastic scaling to handle traffic spikes, multi-model management across different hardware configurations, and observability into every request. The gap between a trained model and a production-grade serving infrastructure is enormous.
Xorbits Inference (Xinference) fills this gap with an open-source platform purpose-built for scalable LLM serving. Originally developed as part of the Xorbits ecosystem for distributed data processing, Xinference has grown into one of the most comprehensive open-source model serving platforms available. It supports a wide range of model architectures – from LLMs and embedding models to vision-language and audio models – and provides the operational tooling needed to run them reliably at scale.
What sets Xinference apart from alternatives like vLLM, TGI, and Ollama is its breadth of model support and operational features. While vLLM focuses on high-throughput LLM serving and Ollama targets local development, Xinference aims to be the one platform that covers the full spectrum: from a single developer running a model on a laptop to a production cluster serving millions of requests across dozens of model variants.
Supported Model Categories
Xinference supports an impressively broad range of model types, each with optimized serving configurations:
| Model Type | Examples | Use Case |
|---|---|---|
| LLMs | LLaMA 3, Qwen 2.5, Mistral, Phi-4, DeepSeek | Chat, code generation, text completion |
| Embedding | BGE, E5, Instructor, Jina | Vector search, RAG pipelines |
| Reranker | BGE Reranker, Cohere Rerank | Search result reordering |
| Image Generation | Stable Diffusion 3, FLUX, DALL-E | Image creation from text |
| Audio | Whisper, Bark, ChatTTS | Speech-to-text, text-to-speech |
| Vision-Language | LLaVA, Qwen-VL, InternVL | Image captioning, visual QA |
Multi-Model Serving Architecture
The following diagram shows how Xinference manages multiple models across a cluster of GPU nodes:
flowchart TD
Client[Client Applications] --> Gateway[Xinference API Gateway]
Gateway --> Router[Model Router]
subgraph Cluster[GPU Cluster]
Router --> M1[Model Instance: LLaMA 3<br>GPU Node 1<br>4-bit quantized]
Router --> M2[Model Instance: BGE Embeddings<br>GPU Node 2<br>Batch size: 32]
Router --> M3[Model Instance: Whisper<br>GPU Node 3<br>FP16]
Router --> M4[Model Instance: Stable Diffusion<br>GPU Node 4<br>3 replicas]
end
M1 --> LB1[Load Balancer]
M2 --> LB2[Load Balancer]
M3 --> LB3[Load Balancer]
M4 --> LB4[Load Balancer]
subgraph Monitoring[Observability Stack]
LB1 --> Metrics[Metrics Collector]
LB2 --> Metrics
LB3 --> Metrics
LB4 --> Metrics
Metrics --> Dashboard[Grafana Dashboard]
Metrics --> Alerts[Alert Manager]
endThe gateway handles request routing, the model router determines which model instance should handle each request, and each model instance can be independently scaled, updated, or replaced without affecting the others. This architecture is critical for production deployments where different teams may own different models with different traffic patterns.
Scaling and Performance
Xinference provides multiple scaling dimensions to handle production traffic:
| Strategy | Mechanism | Time to Scale | Best For |
|---|---|---|---|
| Vertical | Increase GPU memory / cores per instance | Minutes | Single large model optimization |
| Horizontal | Add more model replicas | Seconds | Traffic spikes, high concurrency |
| Speculative | Batch requests to same model on one GPU | Milliseconds | High-throughput, low-variety workloads |
| Model Parallel | Shard a single model across GPUs | Hours | Models too large for one GPU |
Getting Started
Xinference can be installed via pip and started in minutes:
pip install "xorbits[inference]"
xinference
This starts the Xinference service on port 9997, providing a web UI for model management and an OpenAI-compatible API endpoint. Visit the Xorbits Inference GitHub repository for installation guides, model configuration examples, and deployment best practices.
The Xinference documentation portal provides comprehensive guides for Kubernetes deployment, GPU configuration, quantization settings, and API integration.
FAQ
What is Xorbits Inference?
Xorbits Inference (Xinference) is an open-source platform for deploying, serving, and managing large language models and other AI models in production. It provides a unified API for diverse model types, automatic scaling, and comprehensive monitoring.
What model types does Xorbits Inference support?
Xinference supports LLMs (including LLaMA, Qwen, Mistral, Phi, and others), embedding models, reranker models, image generation models (Stable Diffusion), audio models (Whisper, Bark), and vision-language models (LLaVA, Qwen-VL).
How does Xorbits handle scaling?
Xinference supports horizontal scaling across multiple GPU nodes. New model replicas can be launched on demand, and the built-in load balancer distributes requests across available replicas. It integrates with Kubernetes for automatic scaling based on metrics like queue depth and GPU utilization.
Does Xorbits support quantization?
Yes. Xinference supports multiple quantization methods including GPTQ, AWQ, GGUF, and bitsandbytes at 4-bit and 8-bit precision. This allows running larger models on limited GPU hardware with minimal quality degradation.
What APIs does Xorbits provide?
Xinference provides OpenAI-compatible API endpoints for LLMs (chat completions, completions, embeddings), REST APIs for model management, a Python SDK for programmatic control, and a web UI for interactive model exploration and management.
Further Reading
- Xorbits Inference GitHub Repository – Source code, releases, and community contributions
- Xinference Documentation – Installation guides, API reference, and deployment tutorials
- vLLM: High-Throughput LLM Serving – Alternative LLM serving engine focused on throughput
- Ollama: Local LLM Runner – Lightweight local model runner for development
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!