Machine Learning

AI May 04, 2026

PowerInfer: High-Speed LLM Inference on Consumer GPUs via CPU-GPU Hybrid Design

Running large language models locally has always been constrained by a hard wall: GPU memory. A 175-billion parameter model in FP16 requires …

AI May 03, 2026

Deploying large language models in production requires more than just loading weights onto a GPU. To achieve acceptable throughput and latency, …

AI May 03, 2026

Large language models are powerful, but their size makes them expensive to deploy. A 70-billion-parameter model in 16-bit precision requires …

AI May 02, 2026

The Transformer architecture has dominated deep learning for years, but a new challenger has emerged: state space models (SSMs). At the heart of …