qwen2.5-vl-72b

Public

Model

Revisions

Stats

1.1K Downloads

1 star

Capabilities

Vision Input

Minimum system memory

47GB

Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct is a vision-language model that processes images, text, and video, supporting structured outputs and visual localization. It is capable of temporal reasoning and can extract structured data from visual content, including charts and layouts.

Intended uses include document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted data.

Sources

The underlying model files this model uses

Based on

🤗lmstudio-community/Qwen2.5-VL-72B-Instruct-GGUF→

GGUF