概要
LLaVA (Large Language and Vision Assistant) is a state-of-the-art open-source multimodal model that combines a CLIP vision encoder with LLaMA/Vicuna language models to bridge the gap between vision and language. This skill provides a comprehensive toolkit for implementing visual instruction tuning, visual question answering (VQA), and detailed image captioning within your AI workflows. It is particularly useful for developers building AI research agents, conversational image analysis tools, or complex document understanding systems that require both visual reasoning and natural language generation.