MTP Speculative Inference
Background
Section titled “Background”MTP (Multi-Token Prediction) is an innovative inference acceleration technique that addresses efficiency bottlenecks in large language model generation. By incorporating specialized pre-training designs, MTP provides efficient draft token prediction capabilities during inference, significantly improving generation speed. Its core value lies in balancing inference efficiency with output quality, offering an optimal solution for long-sequence generation problems in LLMs, ultimately optimizing inference performance.
Key Features
Section titled “Key Features”MTP offers the following core acceleration capabilities:
-
Efficient Draft Generation: Uses a lightweight MTP architecture to rapidly generate draft tokens that serve as input for the main model’s verification, dramatically reducing computation overhead compared to traditional autoregressive generation.
-
Batch Verification Mechanism: The main model can simultaneously verify multiple MTP-generated draft tokens in batch, rather than processing them sequentially, significantly boosting inference speed.
-
High Sampling Accuracy: MTP solves the critical pain point of low token acceptance rates in post-training draft modules (like Eagle and Medusa). By optimizing draft generation during pre-training, MTP produces tokens with higher accuracy, reducing the verification burden on the main model.
-
Reduced Inference Latency: By pre-generating multiple potential subsequent tokens, MTP effectively decreases cumulative latency during long-text generation, creating a smoother user experience.
-
Optimized Resource Consumption: Compared to other inference acceleration techniques, MTP maintains acceleration effects while requiring fewer additional computational resources, making it suitable for deployment in resource-constrained environments.
MTP technology provides a novel efficiency optimization solution for LLM inference, particularly well-suited for real-time applications requiring rapid responses, representing an important direction in language model inference optimization.
Usage Example
Section titled “Usage Example”Export Model
Section titled “Export Model”The script will automatically detect the model type, or you can manually specify it.
DeepSeek-V3
Section titled “DeepSeek-V3”python3 tools/export_mtp.py \ --input-dir /path/to/DeepSeek-V3 \ --output-dir /path/to/DeepSeek-V3-mtpDeepSeek-V3.2
Section titled “DeepSeek-V3.2”python3 tools/export_mtp.py \ --input-dir /path/to/DeepSeek-V3.2 \ --output-dir /path/to/DeepSeek-V3.2-mtpDeepSeek-R1
Section titled “DeepSeek-R1”python3 tools/export_mtp.py \ --input-dir /path/to/DeepSeek-R1 \ --output-dir /path/to/DeepSeek-R1-mtpGLM4 MoE
Section titled “GLM4 MoE”python3 tools/export_mtp.py \ --input-dir /path/to/GLM-4.5-Air \ --output-dir /path/to/GLM-4.5-Air-mtpManually Specify Model Type
Section titled “Manually Specify Model Type”If auto-detection fails, you can manually specify the model type:
python3 tools/export_mtp.py \ --input-dir /path/to/model \ --output-dir /path/to/model-mtp \ --model-type deepseek_v3 # Options: deepseek_v3 (for V3/R1), deepseek_v32 (for V3.2), glm4_moeInput model references:
Launch Script
Section titled “Launch Script”When using MTP for inference, you need to specify both the main model and the draft model (MTP model).
DeepSeek-V3/V3.2/R1 Launch Example
Section titled “DeepSeek-V3/V3.2/R1 Launch Example”MODEL_PATH="/models/DeepSeek-V3"DRAFT_MODEL_PATH="/models/DeepSeek-V3-mtp"MASTER_NODE_ADDR="127.0.0.1:42123"START_PORT=13222START_DEVICE=0LOG_DIR="log"NNODES=16
for (( i=0; i<$NNODES; i++ ))do PORT=$((START_PORT + i)) DEVICE=$((START_DEVICE + i)) LOG_FILE="$LOG_DIR/node_$i.log" nohup ./xllm \ --model $MODEL_PATH \ --devices="npu:$DEVICE" \ --port $PORT \ --master_node_addr=$MASTER_NODE_ADDR \ --nnodes=$NNODES \ --draft_model $DRAFT_MODEL_PATH \ --draft_devices="npu:$DEVICE" \ --num_speculative_tokens 1 \ --max_memory_utilization=0.90 \ --max_tokens_per_batch=10000 \ --max_seqs_per_batch=256 \ --block_size=128 \ --ep_size=1 \ --dp_size=1 \ --enable_prefix_cache=false \ --enable_chunked_prefill=false \ --node_rank=$i > $LOG_FILE 2>&1 & sleep 0.5doneGLM4 MoE Launch Example
Section titled “GLM4 MoE Launch Example”MODEL_PATH="/models/GLM-4.5-Air"DRAFT_MODEL_PATH="/models/GLM-4.5-Air-mtp"# ... same other configurationsPerformance Data
Section titled “Performance Data”Based on ShareGPT dataset with input length=2500, output length=1500, total requests=80.
| method | Concurrency | Mean TPOT(ms) | Mean TTFT(ms) | Output Tokens/s | Total Tokens/s |
|---|---|---|---|---|---|
| baseline | 1 | 40.61 | 141.80 | 24.20 | 65.77 |
| mtp | 1 | 28.33 | 142.35 | 35.19 | 95.52 |
| baseline | 2 | 42.69 | 178.59 | 45.16 | 122.74 |
| mtp | 2 | 29.81 | 187.97 | 64.75 | 175.78 |
| baseline | 4 | 46.18 | 172.34 | 79.83 | 216.96 |
| mtp | 4 | 33.54 | 194.22 | 111.18 | 301.81 |
| baseline | 8 | 53.16 | 181.49 | 110.68 | 300.81 |
| mtp | 8 | 40.99 | 203.37 | 154.46 | 419.34 |
| baseline | 16 | 68.50 | 213.89 | 143.81 | 390.84 |
| mtp | 16 | 57.04 | 254.99 | 201.89 | 548.04 |
| baseline | 20 | 74.72 | 228.80 | 154.77 | 420.65 |
| mtp | 20 | 61.73 | 264.34 | 206.24 | 559.84 |
| baseline | 40 | 119.68 | 559.32 | 180.22 | 489.80 |
| mtp | 40 | 105.70 | 544.54 | 252.91 | 686.74 |
| baseline | 80 | 180.89 | 2996.21 | 192.09 | 522.06 |
| mtp | 80 | 152.19 | 2163.72 | 278.07 | 755.12 |