AI

Xorbits Inference: Scalable LLM Serving Platform

Xorbits Inference is a scalable LLM serving platform for deploying and managing large language models in production with multi-model support.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
Xorbits Inference: Scalable LLM Serving Platform

Deploying large language models in production is a fundamentally different challenge from training them. Training requires massive clusters and weeks of compute, but can tolerate batch processing and variable throughput. Production inference requires consistent sub-second latency, elastic scaling to handle traffic spikes, multi-model management across different hardware configurations, and observability into every request. The gap between a trained model and a production-grade serving infrastructure is enormous.

Xorbits Inference (Xinference) fills this gap with an open-source platform purpose-built for scalable LLM serving. Originally developed as part of the Xorbits ecosystem for distributed data processing, Xinference has grown into one of the most comprehensive open-source model serving platforms available. It supports a wide range of model architectures – from LLMs and embedding models to vision-language and audio models – and provides the operational tooling needed to run them reliably at scale.

What sets Xinference apart from alternatives like vLLM, TGI, and Ollama is its breadth of model support and operational features. While vLLM focuses on high-throughput LLM serving and Ollama targets local development, Xinference aims to be the one platform that covers the full spectrum: from a single developer running a model on a laptop to a production cluster serving millions of requests across dozens of model variants.

Supported Model Categories

Xinference supports an impressively broad range of model types, each with optimized serving configurations:

Model TypeExamplesUse Case
LLMsLLaMA 3, Qwen 2.5, Mistral, Phi-4, DeepSeekChat, code generation, text completion
EmbeddingBGE, E5, Instructor, JinaVector search, RAG pipelines
RerankerBGE Reranker, Cohere RerankSearch result reordering
Image GenerationStable Diffusion 3, FLUX, DALL-EImage creation from text
AudioWhisper, Bark, ChatTTSSpeech-to-text, text-to-speech
Vision-LanguageLLaVA, Qwen-VL, InternVLImage captioning, visual QA

Multi-Model Serving Architecture

The following diagram shows how Xinference manages multiple models across a cluster of GPU nodes:

The gateway handles request routing, the model router determines which model instance should handle each request, and each model instance can be independently scaled, updated, or replaced without affecting the others. This architecture is critical for production deployments where different teams may own different models with different traffic patterns.

Scaling and Performance

Xinference provides multiple scaling dimensions to handle production traffic:

StrategyMechanismTime to ScaleBest For
VerticalIncrease GPU memory / cores per instanceMinutesSingle large model optimization
HorizontalAdd more model replicasSecondsTraffic spikes, high concurrency
SpeculativeBatch requests to same model on one GPUMillisecondsHigh-throughput, low-variety workloads
Model ParallelShard a single model across GPUsHoursModels too large for one GPU

Getting Started

Xinference can be installed via pip and started in minutes:

pip install "xorbits[inference]"
xinference

This starts the Xinference service on port 9997, providing a web UI for model management and an OpenAI-compatible API endpoint. Visit the Xorbits Inference GitHub repository for installation guides, model configuration examples, and deployment best practices.

The Xinference documentation portal provides comprehensive guides for Kubernetes deployment, GPU configuration, quantization settings, and API integration.

FAQ

What is Xorbits Inference?

Xorbits Inference (Xinference) is an open-source platform for deploying, serving, and managing large language models and other AI models in production. It provides a unified API for diverse model types, automatic scaling, and comprehensive monitoring.

What model types does Xorbits Inference support?

Xinference supports LLMs (including LLaMA, Qwen, Mistral, Phi, and others), embedding models, reranker models, image generation models (Stable Diffusion), audio models (Whisper, Bark), and vision-language models (LLaVA, Qwen-VL).

How does Xorbits handle scaling?

Xinference supports horizontal scaling across multiple GPU nodes. New model replicas can be launched on demand, and the built-in load balancer distributes requests across available replicas. It integrates with Kubernetes for automatic scaling based on metrics like queue depth and GPU utilization.

Does Xorbits support quantization?

Yes. Xinference supports multiple quantization methods including GPTQ, AWQ, GGUF, and bitsandbytes at 4-bit and 8-bit precision. This allows running larger models on limited GPU hardware with minimal quality degradation.

What APIs does Xorbits provide?

Xinference provides OpenAI-compatible API endpoints for LLMs (chat completions, completions, embeddings), REST APIs for model management, a Python SDK for programmatic control, and a web UI for interactive model exploration and management.


Further Reading

TAG
CATEGORIES