LAVIS: Salesforce's Library for Vision-Language AI

LAVIS is a deep learning library for vision-language research supporting BLIP, BLIP-2, InstructBLIP, and image-text retrieval, captioning, and QA tasks.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 05, 2026 2 min read

Vision-language AI – models that understand both images and text – is one of the most rapidly advancing areas of artificial intelligence. Salesforce’s LAVIS (Library for Vision-Language Intelligence) provides a unified framework for training, evaluating, and deploying a wide range of vision-language models including BLIP, BLIP-2, InstructBLIP, and ALBEF.

LAVIS is designed for both researchers and practitioners. Researchers get clean implementations of state-of-the-art models with reproducible benchmarks, while practitioners get a streamlined API for applying these models to real-world tasks like image captioning, visual question answering, and cross-modal retrieval.

Supported Models

Model	Tasks	Year	Parameters
BLIP	Captioning, retrieval, VQA	2022	470M
BLIP-2	Captioning, VQA, retrieval	2023	1.2B
InstructBLIP	Instruction-following VQA	2023	1.2B
ALBEF	Retrieval, grounding	2021	210M
ALPRO	Video-language tasks	2022	250M

Model Architecture

flowchart LR
    A[Image] --> B[Vision Encoder<br/>ViT]
    C[Text] --> D[Text Encoder<br/>BERT]
    B --> E[Cross-Modal Attention]
    D --> E
    E --> F{Fusion Strategy}
    F -->|BLIP| G[Multi-modal Encoder]
    F -->|BLIP-2| H[Q-Former]
    F -->|InstructBLIP| I[Q-Former + LLM]
    G --> J[Output]
    H --> J
    I --> J

Each model in LAVIS uses a different fusion strategy. BLIP uses a standard multi-modal encoder, BLIP-2 introduces the Q-Former (a lightweight transformer that bridges vision and text), and InstructBLIP adds a frozen LLM for instruction-following.

Task Performance

Task	BLIP-2	InstructBLIP	GPT-4V
VQAv2 accuracy	65.0%	73.2%	75.5%
Image captioning (CIDEr)	136.7	142.3	145.1
Zero-shot retrieval	62.3%	67.8%	70.2%
OKVQA accuracy	52.4%	57.3%	61.8%

For more information, visit the LAVIS GitHub repository and the LAVIS documentation.

Frequently Asked Questions

Q: What GPU hardware is recommended for LAVIS? A: BLIP-2 and InstructBLIP require at least 16GB GPU memory. Smaller models like BLIP run on 8GB.

Q: Can I fine-tune models in LAVIS on custom data? A: Yes, LAVIS provides training scripts and configuration files for fine-tuning on custom datasets.

Q: Does LAVIS support video input? A: Yes, through the ALPRO model which handles video-language understanding tasks.

Q: Is LAVIS compatible with PyTorch Lightning? A: Yes, LAVIS uses PyTorch and can integrate with Lightning for distributed training.

Q: What dataset formats does LAVIS support? A: COCO, Visual Genome, SBU Captions, and custom JSON/CSV formats through its data module.

LAVIS: Salesforce's Library for Vision-Language AI

Supported Models

Model Architecture

Task Performance

Frequently Asked Questions

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES