Vision-language AI – models that understand both images and text – is one of the most rapidly advancing areas of artificial intelligence. Salesforce’s LAVIS (Library for Vision-Language Intelligence) provides a unified framework for training, evaluating, and deploying a wide range of vision-language models including BLIP, BLIP-2, InstructBLIP, and ALBEF.
LAVIS is designed for both researchers and practitioners. Researchers get clean implementations of state-of-the-art models with reproducible benchmarks, while practitioners get a streamlined API for applying these models to real-world tasks like image captioning, visual question answering, and cross-modal retrieval.
Supported Models
| Model | Tasks | Year | Parameters |
|---|---|---|---|
| BLIP | Captioning, retrieval, VQA | 2022 | 470M |
| BLIP-2 | Captioning, VQA, retrieval | 2023 | 1.2B |
| InstructBLIP | Instruction-following VQA | 2023 | 1.2B |
| ALBEF | Retrieval, grounding | 2021 | 210M |
| ALPRO | Video-language tasks | 2022 | 250M |
Model Architecture
flowchart LR
A[Image] --> B[Vision Encoder<br/>ViT]
C[Text] --> D[Text Encoder<br/>BERT]
B --> E[Cross-Modal Attention]
D --> E
E --> F{Fusion Strategy}
F -->|BLIP| G[Multi-modal Encoder]
F -->|BLIP-2| H[Q-Former]
F -->|InstructBLIP| I[Q-Former + LLM]
G --> J[Output]
H --> J
I --> JEach model in LAVIS uses a different fusion strategy. BLIP uses a standard multi-modal encoder, BLIP-2 introduces the Q-Former (a lightweight transformer that bridges vision and text), and InstructBLIP adds a frozen LLM for instruction-following.
Task Performance
| Task | BLIP-2 | InstructBLIP | GPT-4V |
|---|---|---|---|
| VQAv2 accuracy | 65.0% | 73.2% | 75.5% |
| Image captioning (CIDEr) | 136.7 | 142.3 | 145.1 |
| Zero-shot retrieval | 62.3% | 67.8% | 70.2% |
| OKVQA accuracy | 52.4% | 57.3% | 61.8% |
For more information, visit the LAVIS GitHub repository and the LAVIS documentation.
Frequently Asked Questions
Q: What GPU hardware is recommended for LAVIS? A: BLIP-2 and InstructBLIP require at least 16GB GPU memory. Smaller models like BLIP run on 8GB.
Q: Can I fine-tune models in LAVIS on custom data? A: Yes, LAVIS provides training scripts and configuration files for fine-tuning on custom datasets.
Q: Does LAVIS support video input? A: Yes, through the ALPRO model which handles video-language understanding tasks.
Q: Is LAVIS compatible with PyTorch Lightning? A: Yes, LAVIS uses PyTorch and can integrate with Lightning for distributed training.
Q: What dataset formats does LAVIS support? A: COCO, Visual Genome, SBU Captions, and custom JSON/CSV formats through its data module.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!