Stats
913 Downloads
3 stars
Capabilities
Minimum system memory
Tags
Last updated
Updated on May 17byREADME
Qwen2.5-VL-3B-Instruct is a vision-language model capable of understanding images, text, and video. It supports structured outputs, visual localization, and can process long videos with temporal reasoning. The model is suitable for tasks involving object recognition, chart and layout analysis, and extracting structured data from visual content.
This model is designed for practical vision-language applications, including document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted structured data.
Sources
The underlying model files this model uses
Based on