Graph Mode
Overview
Section titled “Overview”xLLM supports Graph Mode: computation graphs are pre-captured and replayed in subsequent runs to reduce CPU overhead and improve inference performance. Graph Mode has corresponding implementations on different hardware platforms.
Feature Description
Section titled “Feature Description”To optimize Host-side scheduling, graph mode submits a large task from the CPU once and then executes small kernels in a streaming manner on the device, significantly reducing startup time and device bubbles.
In the xLLM engine, Graph Mode provides the following:
Dynamic Shape Parameterization
Section titled “Dynamic Shape Parameterization”- Key dynamic dimensions other than num_tokens are treated as whole-graph input parameters, including batch_size, kv_seq_lens, q_seq_lens, block_table_size, and the like, for flexibility. During memory allocation and kernel configuration. At graph launch, the actual values of these parameters are passed so kernels use the correct strides to access data.
Piecewise Graph
Section titled “Piecewise Graph”- When some operators do not support graph capture and thus break the full graph, each segment (piece) after the break is captured as a separate graph. This maximizes graph-mode benefits even when the full graph cannot be captured, and is commonly used for prefill and chunked prefill.
Multi-Shape Reusable Memory Pool
Section titled “Multi-Shape Reusable Memory Pool”- To avoid waste from separate memory buffers (input, output, and intermediate tensors) per shape, we use an expandable memory pool. Multiple shapes share the pool base address, with different shapes using different offsets from that base.
These capabilities are implemented inside the xLLM engine and are generally controlled through gflags.
The minimal configuration only needs enable_graph to turn on Graph Mode for the decode phase:
--enable_graph=trueCommon companion flags include:
enable_graph: enables the base Graph Mode capability for the decode phaseenable_prefill_piecewise_graph: enables Piecewise Graph for the prefill phaseenable_graph_mode_decode_no_padding: builds decode graphs with the actualnum_tokensinstead of the padded shapemax_tokens_for_graph_mode: limits the maximum number of tokens covered by Graph Mode;0means no limit
If you want to enable both decode Graph and prefill Piecewise Graph, use:
--enable_graph=true \--enable_prefill_piecewise_graph=true \--max_tokens_for_graph_mode=2048If you need decode graph capture without padding, add:
--enable_graph=true \--enable_graph_mode_decode_no_padding=trueFor a more complete description of the flags, see CLI Reference.
Performance Impact
Section titled “Performance Impact”- With Graph Mode enabled, decode-phase throughput improves by about 8%–10% on models such as Qwen3-0.6B and Qwen3-1.7B.
Model Support
Section titled “Model Support”The following table lists each model’s support on ACLGraph, CudaGraph, and MLUGraph.
| Model | ACLGraph | CudaGraph | MLUGraph |
|---|---|---|---|
| Qwen3/Qwen3-MoE | ✅ | ✅ | ✅ |
| DeepseekV3.2 | ✅ | ✅ | |
| GLM4.5/4.6/4.7 | ✅ | ||
| Qwen2.5-VL | ✅ | ||
| Qwen3-VL/Qwen3-VL-MoE | ✅ | ||
| GLM4V | ✅ | ||
| GLM4V-MoE | ✅ |
Related Documentation
Section titled “Related Documentation”- For more detailed Graph Mode design and implementation notes, including ACL Graph / CUDA Graph fundamentals, dynamic dimension parameterization, Piecewise Graph, and multi-shape memory reuse, see: Graph Mode Design Document