Skip to content
EN

MTP Speculative Inference

MTP (Multi-Token Prediction) is an innovative inference acceleration technique that addresses efficiency bottlenecks in large language model generation. By incorporating specialized pre-training designs, MTP provides efficient draft token prediction capabilities during inference, significantly improving generation speed. Its core value lies in balancing inference efficiency with output quality, offering an optimal solution for long-sequence generation problems in LLMs, ultimately optimizing inference performance.

MTP offers the following core acceleration capabilities:

  • Efficient Draft Generation: Uses a lightweight MTP architecture to rapidly generate draft tokens that serve as input for the main model’s verification, dramatically reducing computation overhead compared to traditional autoregressive generation.

  • Batch Verification Mechanism: The main model can simultaneously verify multiple MTP-generated draft tokens in batch, rather than processing them sequentially, significantly boosting inference speed.

  • High Sampling Accuracy: MTP solves the critical pain point of low token acceptance rates in post-training draft modules (like Eagle and Medusa). By optimizing draft generation during pre-training, MTP produces tokens with higher accuracy, reducing the verification burden on the main model.

  • Reduced Inference Latency: By pre-generating multiple potential subsequent tokens, MTP effectively decreases cumulative latency during long-text generation, creating a smoother user experience.

  • Optimized Resource Consumption: Compared to other inference acceleration techniques, MTP maintains acceleration effects while requiring fewer additional computational resources, making it suitable for deployment in resource-constrained environments.

MTP technology provides a novel efficiency optimization solution for LLM inference, particularly well-suited for real-time applications requiring rapid responses, representing an important direction in language model inference optimization.

The script will automatically detect the model type, or you can manually specify it.

Terminal window
python3 tools/export_mtp.py \
--input-dir /path/to/DeepSeek-V3 \
--output-dir /path/to/DeepSeek-V3-mtp
Terminal window
python3 tools/export_mtp.py \
--input-dir /path/to/DeepSeek-V3.2 \
--output-dir /path/to/DeepSeek-V3.2-mtp
Terminal window
python3 tools/export_mtp.py \
--input-dir /path/to/DeepSeek-R1 \
--output-dir /path/to/DeepSeek-R1-mtp
Terminal window
python3 tools/export_mtp.py \
--input-dir /path/to/GLM-4.5-Air \
--output-dir /path/to/GLM-4.5-Air-mtp

If auto-detection fails, you can manually specify the model type:

Terminal window
python3 tools/export_mtp.py \
--input-dir /path/to/model \
--output-dir /path/to/model-mtp \
--model-type deepseek_v3 # Options: deepseek_v3 (for V3/R1), deepseek_v32 (for V3.2), glm4_moe

Input model references:

When using MTP for inference, you need to specify both the main model and the draft model (MTP model).

Terminal window
MODEL_PATH="/models/DeepSeek-V3"
DRAFT_MODEL_PATH="/models/DeepSeek-V3-mtp"
MASTER_NODE_ADDR="127.0.0.1:42123"
START_PORT=13222
START_DEVICE=0
LOG_DIR="log"
NNODES=16
for (( i=0; i<$NNODES; i++ ))
do
PORT=$((START_PORT + i))
DEVICE=$((START_DEVICE + i))
LOG_FILE="$LOG_DIR/node_$i.log"
nohup ./xllm \
--model $MODEL_PATH \
--devices="npu:$DEVICE" \
--port $PORT \
--master_node_addr=$MASTER_NODE_ADDR \
--nnodes=$NNODES \
--draft_model $DRAFT_MODEL_PATH \
--draft_devices="npu:$DEVICE" \
--num_speculative_tokens 1 \
--max_memory_utilization=0.90 \
--max_tokens_per_batch=10000 \
--max_seqs_per_batch=256 \
--block_size=128 \
--ep_size=1 \
--dp_size=1 \
--enable_prefix_cache=false \
--enable_chunked_prefill=false \
--node_rank=$i > $LOG_FILE 2>&1 &
sleep 0.5
done
Terminal window
MODEL_PATH="/models/GLM-4.5-Air"
DRAFT_MODEL_PATH="/models/GLM-4.5-Air-mtp"
# ... same other configurations

Based on ShareGPT dataset with input length=2500, output length=1500, total requests=80.

methodConcurrencyMean TPOT(ms)Mean TTFT(ms)Output Tokens/sTotal Tokens/s
baseline140.61141.8024.2065.77
mtp128.33142.3535.1995.52
baseline242.69178.5945.16122.74
mtp229.81187.9764.75175.78
baseline446.18172.3479.83216.96
mtp433.54194.22111.18301.81
baseline853.16181.49110.68300.81
mtp840.99203.37154.46419.34
baseline1668.50213.89143.81390.84
mtp1657.04254.99201.89548.04
baseline2074.72228.80154.77420.65
mtp2061.73264.34206.24559.84
baseline40119.68559.32180.22489.80
mtp40105.70544.54252.91686.74
baseline80180.892996.21192.09522.06
mtp80152.192163.72278.07755.12