AI

LAVIS: Salesforce's Library for Vision-Language AI

LAVIS is a deep learning library for vision-language research supporting BLIP, BLIP-2, InstructBLIP, and image-text retrieval, captioning, and QA tasks.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
LAVIS: Salesforce's Library for Vision-Language AI

Vision-language AI – models that understand both images and text – is one of the most rapidly advancing areas of artificial intelligence. Salesforce’s LAVIS (Library for Vision-Language Intelligence) provides a unified framework for training, evaluating, and deploying a wide range of vision-language models including BLIP, BLIP-2, InstructBLIP, and ALBEF.

LAVIS is designed for both researchers and practitioners. Researchers get clean implementations of state-of-the-art models with reproducible benchmarks, while practitioners get a streamlined API for applying these models to real-world tasks like image captioning, visual question answering, and cross-modal retrieval.

Supported Models

ModelTasksYearParameters
BLIPCaptioning, retrieval, VQA2022470M
BLIP-2Captioning, VQA, retrieval20231.2B
InstructBLIPInstruction-following VQA20231.2B
ALBEFRetrieval, grounding2021210M
ALPROVideo-language tasks2022250M

Model Architecture

Each model in LAVIS uses a different fusion strategy. BLIP uses a standard multi-modal encoder, BLIP-2 introduces the Q-Former (a lightweight transformer that bridges vision and text), and InstructBLIP adds a frozen LLM for instruction-following.

Task Performance

TaskBLIP-2InstructBLIPGPT-4V
VQAv2 accuracy65.0%73.2%75.5%
Image captioning (CIDEr)136.7142.3145.1
Zero-shot retrieval62.3%67.8%70.2%
OKVQA accuracy52.4%57.3%61.8%

For more information, visit the LAVIS GitHub repository and the LAVIS documentation.

Frequently Asked Questions

Q: What GPU hardware is recommended for LAVIS? A: BLIP-2 and InstructBLIP require at least 16GB GPU memory. Smaller models like BLIP run on 8GB.

Q: Can I fine-tune models in LAVIS on custom data? A: Yes, LAVIS provides training scripts and configuration files for fine-tuning on custom datasets.

Q: Does LAVIS support video input? A: Yes, through the ALPRO model which handles video-language understanding tasks.

Q: Is LAVIS compatible with PyTorch Lightning? A: Yes, LAVIS uses PyTorch and can integrate with Lightning for distributed training.

Q: What dataset formats does LAVIS support? A: COCO, Visual Genome, SBU Captions, and custom JSON/CSV formats through its data module.

TAG
CATEGORIES