Distributed computing is the hidden tax on AI and data-intensive applications. The logic of your application — the training loop, the batch processor, the inference pipeline — is straightforward. But distributing that logic across multiple machines introduces a cascade of complexity: task scheduling, data serialization, fault tolerance, resource management, and cluster coordination.
Ray was created at UC Berkeley’s RISELab to eliminate this tax. It provides a minimal set of distributed computing primitives — tasks for stateless remote execution, actors for stateful remote computation, and a distributed object store for data sharing — that are powerful enough to build any distributed application and simple enough that a single developer can use them productively. The Ray ecosystem extends these primitives into specialized libraries for AI workloads that have become the de facto standard for production AI infrastructure.
How Does Ray’s Programming Model Simplify Distributed Computing?
Ray’s core insight is that distributed computing primitives should look like normal Python function and class definitions. The @ray.remote decorator transforms any Python function into a remote task that can execute on any machine in the cluster. The same decorator on a class creates an actor — a remote object with state that persists across calls.
Tasks are the foundation. A @ray.remote function returns a future (ObjectRef) that represents the eventual result. Ray handles scheduling — finding an available machine, serializing arguments, transferring data, executing the function, and returning the result. Dependencies between tasks are expressed naturally by passing object references as arguments — Ray constructs a dependency graph and executes tasks in parallel where possible.
| Distributed Concept | Ray Primitive | Python Equivalent |
|---|---|---|
| Remote procedure call | @ray.remote on function | def |
| Remote stateful service | @ray.remote on class | class |
| Distributed object | ray.put() / ray.get() | Variable assignment |
| Parallel execution | ray.wait() on futures | asyncio.wait() |
| Resource management | @ray.remote(num_gpus=1) | N/A |
Actors provide stateful computation. An actor is instantiated on a specific machine and maintains its state across method calls. This is essential for model serving (models loaded in GPU memory), simulation state (game state across iterations), and coordination services (shared counters, rate limiters).
What Is the Ray AI Runtime Ecosystem?
Built on Ray Core, the Ray AI Runtime (Ray AIR) provides specialized libraries for the full AI development lifecycle. Each library handles a specific phase while sharing Ray’s distributed runtime — meaning data processing results flow directly into training, which flows into tuning, which flows into serving, without data movement or framework switching.
Ray Data provides distributed data loading and preprocessing. It reads from S3, GCS, HDFS, or local storage, transforms data with map/batch operations, and feeds directly into training. Ray Train handles distributed training with PyTorch, TensorFlow, and JAX, managing device placement, gradient synchronization, and checkpointing. RLlib provides production-ready reinforcement learning with built-in algorithms and distributed environment execution.
| Ray AIR Component | Purpose | What It Replaces |
|---|---|---|
| Ray Data | Distributed data processing | Spark DataFrame, Dask |
| Ray Train | Distributed training | Custom torch.distributed setup |
| Ray Tune | Hyperparameter optimization | Optuna, Hyperopt |
| RLlib | Reinforcement learning | OpenAI Baselines, Stable Baselines |
| Ray Serve | Model serving (CPU/GPU) | Custom FastAPI + Kubernetes |
| Ray Cluster | Cluster management | Kubernetes YAML, manual setup |
Ray Tune automates hyperparameter search across a cluster, supporting Bayesian optimization, population-based training, and asynchronous hyperband. RLlib provides implementations of popular RL algorithms (PPO, SAC, DQN, APEX) that scale from single-GPU to multi-node distributed training without code changes.
How Does Ray Serve Handle Production Model Serving?
Ray Serve is the serving component of the Ray ecosystem, designed for production model deployment. It handles HTTP request routing, model loading, request batching, autoscaling, and deployment management. Serve integrates with Ray’s object store for efficient data passing and supports multi-model deployments with A/B testing and canary releases.
For LLM serving, Ray Serve integrates with vLLM and Hugging Face Text Generation Inference. It handles the complexities of continuous batching, KV cache management, and GPU memory allocation. Deployments can span multiple GPUs with automatic request routing and load balancing.
flowchart TD
A[HTTP 請求] --> B[Ray Serve 路由器]
B --> C[部署 1<br/>GPT-style Model]
B --> D[部署 2<br/>Embedding Model]
B --> E[部署 N<br/>Custom Model]
C --> F[請求批处理]
F --> G[vLLM 推论]
G --> H[回应]
D --> I[批处理]
I --> J[嵌入推论]
J --> K[回应]
E --> L[自定义逻辑]
L --> M[回应]
H --> N[用户端]
K --> N
M --> NDeployments can be updated without downtime. A canary deployment routes a percentage of traffic to a new model version while the current version handles the rest. If metrics degrade, traffic shifts back. If performance is satisfactory, the canary graduates to full deployment.
How Do You Deploy Ray in Production?
Ray supports multiple deployment modes. For development, ray start --head starts a single-node cluster on your laptop. For production, the Ray Cluster launcher handles cloud (AWS, GCP, Azure) and Kubernetes deployment. Auto-scaling cluster configurations specify minimum, maximum, and desired node counts, with Ray automatically provisioning and terminating nodes based on workload.
Kubernetes deployment uses the Ray Kubernetes Operator, which manages Ray clusters as custom resources. The operator handles cluster creation, scaling, upgrades, and failure recovery. Ray’s autoscaler adjusts cluster size based on pending tasks and available resources, scaling down to zero when idle.
| Deployment Feature | Ray | Kubernetes (bare) |
|---|---|---|
| Task scheduling | Ray scheduler | Kubernetes scheduler |
| GPU management | Automatic | Manual (node selectors) |
| Autoscaling | Workload-based | Metrics-based (HPA) |
| Fault tolerance | Built-in task retry | Pod restart |
| Multi-tenant | Per-job resource isolation | Namespaces + quotas |
Ray’s resource management ensures efficient GPU utilization. Tasks and actors declare resource requirements (num_gpus=1, memory=4GB), and Ray’s scheduler packs them onto nodes optimally. This is significantly more efficient than Kubernetes-style per-pod GPU allocation, as Ray can multiplex multiple tasks on a single GPU.
FAQ
What is Ray and what problems does it solve? Ray is an open-source unified framework for scaling Python and AI applications. It provides simple primitives (tasks, actors, objects) that abstract distributed systems complexity.
What are the key components of the Ray ecosystem? Ray Core, Ray Train, Ray Serve, RLlib, Ray Data, Ray Tune, and the Ray Cluster launcher for cloud and Kubernetes deployment.
How does Ray Serve handle LLM serving? Ray Serve provides HTTP routing, request batching, autoscaling, and model deployment with OpenAI-compatible APIs and vLLM integration for production LLM serving.
Can Ray run on a single machine? Yes. Ray runs on a single machine for development and scales to multi-node clusters for production with the same code.
Who uses Ray in production? OpenAI, Uber, Amazon, Shopify, and LinkedIn use Ray for production AI workloads, including GPT-4 training.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!