AI

Twinny: Local LLM Inference for VS Code

Twinny is a VS Code extension for running local LLM inference with Ollama, providing AI code completion and chat without cloud dependencies.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
Twinny: Local LLM Inference for VS Code

The tension between cloud-dependent AI tools and developer privacy has become one of the defining debates in AI-assisted software development. Services like GitHub Copilot and Cursor offer impressive code completion capabilities, but they require sending your code to external servers. For developers working on proprietary code, in regulated industries, or simply preferring not to share their work product with cloud services, this is a non-starter. The answer is local AI, and Twinny is one of the best ways to access it.

Twinny is a free, open-source VS Code extension that brings local LLM inference directly into your editor. It connects to Ollama – the popular local model runner – and provides AI code completion and chat assistance without any data ever leaving your machine. No subscription, no rate limits, no cloud dependency. Just a local model running on your hardware, integrated into your development workflow.

The experience is remarkably similar to GitHub Copilot. As you type, Twinny suggests completions in ghost text. You can press Tab to accept, or continue typing to refine. The inline chat panel lets you ask questions about your code, request refactoring, or generate new code – all running through a local model that costs nothing to operate after the initial download. The quality depends on which model you choose, and the community has converged on several excellent options that rival cloud-based solutions for most everyday coding tasks.

Feature Comparison

Twinny provides a comprehensive set of AI-assisted coding features through its VS Code integration:

FeatureTwinny (Local)GitHub CopilotCursor
Code CompletionsInline ghost textInline ghost textInline ghost text
ChatSide panel + inlineSide panelBuilt-in
PrivacyFully localCloud-dependentCloud-dependent
Model ChoiceAny Ollama modelOpenAI proprietaryGPT-4 / Claude
CostFree$10/month$20/month
Rate LimitsNoneYes (hourly)Yes (per usage tier)
OfflineYes (with downloaded models)NoNo
Custom PromptsUser-definedLimitedLimited

Twinny Workflow

The following diagram illustrates how Twinny processes a code completion request through the local stack:

Each keystroke triggers this pipeline, which typically completes in 200-500ms on a modern GPU with a 7B model. The Twinny extension handles context extraction, prompt construction, and result filtering, while Ollama manages the model lifecycle and inference.

The quality of Twinny’s output depends heavily on model selection:

ModelParametersCode QualitySpeedVRAMBest For
CodeGemma7BExcellentFast6GBGeneral code completion
DeepSeek-Coder6.7BExcellentFast6GBComplex code generation
StarCoder27BVery GoodFast6GBMulti-language support
Qwen 2.5 Coder7BVery GoodFast6GBChinese + English code
Llama 3.18BGoodModerate8GBChat + general coding
Qwen 2.5 Coder14BExcellentModerate12GBHigh-quality completions

Getting Started

To start using Twinny, first install Ollama and download a model, then install the Twinny extension in VS Code:

# Install Ollama and download a code model
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull codegemma

# Then search for "Twinny" in VS Code Extensions panel

The Twinny GitHub repository provides comprehensive setup guides, configuration tips, and community discussions about model preferences and performance tuning.

FAQ

What is Twinny?

Twinny is a free, open-source VS Code extension that provides AI code completion and chat using locally running LLMs via Ollama. It is a privacy-first alternative to GitHub Copilot with no cloud dependencies, no data leaving your machine, and no subscription fees.

How does Twinny compare to GitHub Copilot?

Twinny is fully local while Copilot is cloud-based. Twinny completions depend on the quality of your local model (Llama 3, Qwen, CodeGemma, etc.) rather than Copilot’s proprietary OpenAI model. Twinny offers unlimited completions with no subscription, while Copilot has a $10/month fee and rate limits.

What models work best with Twinny?

For code completion, specialized code models like CodeGemma, StarCoder2, and DeepSeek-Coder perform best. For chat-based assistance, general-purpose models like Llama 3 and Qwen 2.5 work well. Models with 7B parameters typically strike the best balance of quality and speed.

What hardware do I need for Twinny?

For 7B parameter models: 8GB RAM (16GB+ recommended), any modern CPU, and a GPU with 6GB+ VRAM (Apple Silicon works well). For larger models, more memory and GPU power are needed. Even without a GPU, reasonable completions are possible with 4-bit quantized models on CPU.

Is Twinny really free?

Yes, Twinny is completely free and open source under the MIT license. There is no subscription, no usage limits, and no data collection. The only cost is the electricity for running your local model, and the storage space for model weights.


Further Reading

TAG
CATEGORIES