qwen2.5-vl-72b

Public

1.1K Downloads

1 star

Capabilities

Vision Input

Minimum system memory

47GB

Tags

72B
qwen2vl

Last updated

Updated on May 17by
lmmy's profile picture
lmmy

README

Qwen2.5-VL-72B-Instruct

Qwen2.5-VL-72B-Instruct is a vision-language model that processes images, text, and video, supporting structured outputs and visual localization. It is capable of temporal reasoning and can extract structured data from visual content, including charts and layouts.

Intended uses include document analysis, event detection in video, and agentic tool use. Outputs can include bounding boxes, points, and JSON-formatted data.

Sources

The underlying model files this model uses