AI

DSPy:斯坦福大学以算法优化 AI 提示的框架

DSPy 是一个以算法优化提示与微调 LLM 的框架,以编程优化取代手动提示工程。

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
DSPy:斯坦福大学以算法优化 AI 提示的框架

Prompt engineering has become an unexpected skill requirement in the AI era. Developers who wanted reliable LLM output learned to craft system prompts, structure few-shot examples, chain instructions, and iterate through trial and error. The process was manual, subjective, and brittle — a prompt that worked perfectly with GPT-4 might fail with Claude, and a prompt that worked last week might degrade after a model update.

DSPy, from the Stanford NLP group, takes a fundamentally different approach. Instead of asking developers to write prompts, it asks them to define the task. You specify what inputs the system receives, what outputs it should produce, and how to measure success. DSPy then treats the prompt as an optimization variable — searching through prompt strategies, few-shot examples, and instruction phrasings to find the combination that maximizes your metric.


How Does DSPy Replace Manual Prompt Engineering?

The core insight of DSPy is that prompt engineering is an optimization problem in disguise. Given a task (translate text, answer questions, extract entities), a set of labeled examples, and a success metric, the goal is to find the prompt configuration that maximizes performance.

DSPy makes this explicit through its module abstraction. A module defines the task boundary — what goes in and what comes out — without specifying how the LLM should be instructed. The module specification includes input/output field signatures, optional constraints on output format, and a reference to the metric that measures output quality.

ApproachEffortConsistencyTransferabilityOptimality
Manual prompt engineeringHigh (hours per prompt)Low (varies by operator)Low (rewrite per model)Low (5-10 trials)
DSPy optimizationMedium (define task + metric)High (algorithmic)High (re-optimize per model)High (100s-1000s of trials)

The optimizer starts with a naive prompt (a simple instruction to perform the task) and iteratively refines it. Each iteration tries a new prompt variant — different phrasings, different few-shot example selections, different output format specifications — evaluates it against the validation set using the provided metric, and keeps the best performers. Over hundreds of iterations, the prompt converges to an optimal configuration.


What Optimization Strategies Does DSPy Support?

DSPy provides multiple optimization strategies, each suited to different task characteristics and resource budgets. The simplest strategies optimize the prompt text and few-shot examples without any fine-tuning. More advanced strategies can fine-tune the underlying model or use ensemble approaches.

BootstrapFewShot is the default and most accessible optimizer. It takes a few labeled examples and asks the LLM to generate additional examples, then selects the best few-shot demonstrations for each prompt. BootstrapFewShotWithRandomSearch extends this by randomly sampling different combinations of examples and instructions, evaluating each combination.

OptimizerApproachBest ForResource Needs
BootstrapFewShotGenerates examples, selects bestQuick optimizationLow (few calls)
BootstrapFewShotWithRandomSearchRandom search over configsBalanced optimizationMedium
MIPROBayesian optimizationMaximum performanceHigh (many calls)
MIPROv2Enhanced Bayesian + fine-tuningSOTA resultsVery high
EnsembleMultiple prompts votedReliability-criticalHigh

MIPRO (Multimodal Instruction-aware Prompt Optimization) uses Bayesian optimization to efficiently search the prompt configuration space. It maintains a probabilistic model of how prompt changes affect performance and uses this model to select the most promising configurations to try. MIPROv2 extends this with automatic detection of whether the optimal configuration involves a better prompt, better examples, or fine-tuning the model itself.


What Does a Real DSPy Workflow Look Like?

A typical DSPy workflow starts with installing the library and defining a language model client. You configure which LLM to use and the DSPy settings. Then you define modules for your task — DSPy provides built-in modules for common patterns like chain-of-thought reasoning, retrieval-augmented generation, and classification.

The critical step is defining the metric. This is the function DSPy uses to evaluate prompt quality during optimization. For a translation task, the metric might be BLEU score or human-rated accuracy. For a Q&A system, it might be exact match or F1 score against reference answers. The quality of the metric directly determines the quality of the optimized prompt.

After optimization, DSPy provides the compiled module with the optimal prompt embedded. You use this module in your application through the same API as the unoptimized version — the optimization happens once, and the result is a drop-in replacement.


How Does DSPy Handle Different Model Families?

One of DSPy’s most practical features is automatic prompt adaptation across model families. A prompt optimized for GPT-4 might not work well with Llama 3 or Claude. DSPy addresses this by re-optimizing prompts when the underlying model changes.

The framework maintains a database of prompt characteristics across models, learning what instruction styles, example formats, and output specifications work best for each model family. When you switch models, DSPy can warm-start the optimization process with a prompt configuration that is known to work well for the new model, reducing the number of optimization iterations needed.

This model-adaptation capability is increasingly important as organizations deploy across multiple model providers. A typical pattern uses DSPy-optimized prompts with GPT-4 for production alongside optimized prompts for local models (via Ollama) for development and testing, ensuring consistency across environments.

Model FamilyOptimal Instruction StyleFew-Shot Sensitivity
GPT-4Direct, conciseMedium
Claude 3Detailed, structuredLow
Llama 3Explicit, step-by-stepHigh
MistralVerbose, examples-heavyHigh
GeminiStructured, bullet-pointMedium

FAQ

What is DSPy and what problem does it solve? DSPy (Declarative Self-improving Python) is a Stanford NLP framework that replaces manual prompt engineering with programmatic optimization. You define the task and metric, and DSPy automatically finds the optimal prompt strategy.

How does DSPy’s optimization process work? DSPy treats prompt construction as an optimization problem. It explores hundreds of prompt variants — different phrasings, few-shot selections, and instructions — using techniques like bootstrap few-shot and Bayesian search to maximize your metric.

Do I need to write prompts when using DSPy? No. In DSPy, you never write prompts directly. You define declarative modules with input/output signatures, and DSPy generates and optimizes prompts automatically.

Can DSPy work with any LLM provider? Yes. DSPy is model-agnostic and supports OpenAI, Anthropic, Google, Cohere, Ollama, and Hugging Face models with automatic prompt adaptation across model families.

How does DSPy compare to manual prompt engineering? DSPy systematically outperforms manual prompts in benchmarks, achieving 10-30% higher accuracy by exploring hundreds of prompt variants instead of the typical 5-10 manually evaluated options.


References

TAG
CATEGORIES