Multimodal Support

This document introduces the current multimodal support in the xLLM inference engine, including supported models, modality types, and offline and online interfaces.

Supported Models

Qwen2.5-VL: including 7B/32B/72B.
Qwen3-VL: including 2B/4B/8B/32B.
Qwen3-VL-MoE: including A3B/A22B.
MiniCPM-V-2_6: 7B.

Modality Types

Images: supports single-image and multi-image inputs, image + prompt combinations, and text-only prompts.