Qwen3-VL-30B

Public

The latest generation vision-language MoE model in the Qwen series with comprehensive upgrades to visual perception, spatial reasoning, and image understanding.

103.2K Downloads

15 stars

Capabilities

Vision Input

Minimum system memory

18GB

Tags

30B
qwen3_vl_moe

README

Qwen3 VL 30B

The latest generation vision-language MoE model in the Qwen series with comprehensive upgrades to visual perception, spatial reasoning, and video understanding.

Key Features

  • Visual Agent: Operates PC and mobile GUIs—recognizes elements, understands functions, and completes tasks
  • Visual Coding: Generates Draw.io, HTML, CSS, and JavaScript from images and videos
  • Advanced Spatial Perception: Provides 2D/3D grounding for spatial reasoning and embodied AI applications
  • Upgraded Recognition: Recognizes celebrities, anime, products, landmarks, flora, fauna, and more
  • Expanded OCR: Supports 32 languages with robust performance in low light, blur, and tilt conditions
  • Pure Text Performance: Text understanding on par with pure LLMs through seamless text-vision fusion
  • High-Efficiency MoE: 31.1B total parameters with only 3B activated (A3B) for excellent efficiency

Architecture Highlights

  • 31.1B total parameters (3B activated per token)
  • Mixture-of-Experts architecture
  • Interleaved-MRoPE for enhanced video reasoning
  • DeepStack for fine-grained detail capture
  • Text-Timestamp Alignment for precise event localization
  • Context length: 256,000 tokens
  • Vision-enabled multimodal MoE model

Performance

Delivers superior vision-language performance across diverse tasks including document analysis, visual question answering, video understanding, and agentic interactions. The MoE architecture provides excellent efficiency while maintaining high-quality outputs.

Parameters

Custom configuration options included with this model

Repeat Penalty
Disabled
Temperature
0.7
Top K Sampling
20
Top P Sampling
0.8