MiniMax-M2.7
-
Source code: https://github.com/jd-opensource/xllm
-
Available in China: https://gitcode.com/xLLM-AI/xllm
-
Weight download: modelscope-MiniMax-M2.7
-
Offline dequantized weights: modelscope-Minimax2.7-BF16-xLLM
0. Weight Preparation
Section titled “0. Weight Preparation”The original MiniMax-M2.7 weights are in FP8 format. xLLM supports the following three loading methods:
Method 1: Load FP8 weights directly (online dequantization)
Section titled “Method 1: Load FP8 weights directly (online dequantization)”Use the original FP8 weight path directly. xLLM will dequantize FP8 to BF16 during inference, so no additional preprocessing is required.
MODEL_PATH=/path/to/MiniMax-M2.7/Method 2: Offline dequantization
Section titled “Method 2: Offline dequantization”Use the tool script to convert FP8 weights to BF16 in advance to avoid the extra overhead of online dequantization:
python tools/dequant_minimax_fp8.py --input-dir /path/to/MiniMax-M2.7/ --output-dir /path/to/MiniMax-M2.7-bf16/Method 3: Download pre-converted BF16 weights
Section titled “Method 3: Download pre-converted BF16 weights”Download the dequantized BF16 weights directly:
git clone https://www.modelscope.cn/Eco-Tech/Minimax2.7-BF16-xLLM.git1. Pull the Image Environment
Section titled “1. Pull the Image Environment”First, download the image provided by xLLM:
# A3 armdocker pull quay.io/jd_xllm/xllm-ai:xllm-dev-a3-arm-20260429Then create the corresponding container:
sudo docker run -it --ipc=host -u 0 --privileged --name xllm_minimax --network=host \ -v /var/queue_schedule:/var/queue_schedule \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \ -v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \ -v /var/log/npu/slog/:/var/log/npu/slog \ -v ~/.ssh:/root/.ssh \ -v /var/log/npu/profiling/:/var/log/npu/profiling \ -v /var/log/npu/dump/:/var/log/npu/dump \ -v /runtime/:/runtime/ -v /etc/hccn.conf:/etc/hccn.conf \ -v /export/home:/export/home \ -v /home/:/home/ \ -w /export/home \ quay.io/jd_xllm/xllm-ai:xllm-dev-a3-arm-202604292. Pull the Source Code and Build
Section titled “2. Pull the Source Code and Build”Download the official repository and module dependencies:
git clone https://github.com/jd-opensource/xllmcd xllmgit checkout preview/minimax-minimalgit submodule initgit submodule updateDownload and install dependencies:
pip install --upgrade pre-commityum install numactlRun the build to generate the executable under build/:
python setup.py buildBuild artifact path: build/xllm/core/server/xllm
3. Start the Model
Section titled “3. Start the Model”If the service is being started for the first time after the machine has rebooted, initialize the devices first
Section titled “If the service is being started for the first time after the machine has rebooted, initialize the devices first”If this is skipped and the NPU has not been initialized, the xLLM process may fail to start.
python -c "import torch_npufor i in range(16):torch_npu.npu.set_device(i)"Environment Variables
Section titled “Environment Variables”##### 1. Configure dependency path environment variablesexport PYTHON_INCLUDE_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"export PYTHON_LIB_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"export PYTORCH_NPU_INSTALL_PATH=/usr/local/libtorch_npu/export PYTORCH_INSTALL_PATH="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"export LIBTORCH_ROOT="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/xllm/op_api/lib/:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/libtorch_npu/lib:$LD_LIBRARY_PATHexport LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.sh
##### 2. Configure log-related environment variablesrm -rf /root/atb/log/rm -rf /root/ascend/log/rm -rf core.*export ASDOPS_LOG_LEVEL=ERRORexport ASDOPS_LOG_TO_STDOUT=1export ASDOPS_LOG_TO_FILE=1
##### 3. Configure performance and communication-related environment variablesexport PYTORCH_NPU_ALLOC_CONF=expandable_segments:Trueexport NPU_MEMORY_FRACTION=0.96export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3export ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
export OMP_NUM_THREADS=12export ALLOW_INTERNAL_FORMAT=1
export ATB_LAYER_INTERNAL_TENSOR_REUSE=1export ATB_LLM_ENABLE_AUTO_TRANSPOSE=0export ATB_CONVERT_NCHW_TO_ND=1export ATB_LAUNCH_KERNEL_WITH_TILING=1export ATB_OPERATION_EXECUTE_ASYNC=2export ATB_CONTEXT_WORKSPACE_SIZE=0export INF_NAN_MODE_ENABLE=1export HCCL_EXEC_TIMEOUT=0export HCCL_CONNECT_TIMEOUT=7200export HCCL_OP_EXPANSION_MODE="AIV"export HCCL_IF_BASE_PORT=2864Startup Command - MiniMax-M2.7 (single machine, 16 cards, TP=16)
Section titled “Startup Command - MiniMax-M2.7 (single machine, 16 cards, TP=16)”BATCH_SIZE=256# Maximum inference batch sizeXLLM_PATH="build/xllm/core/server/xllm"# Inference entry binary path, which is the build artifact from the previous stepMODEL_PATH=/path/to/MiniMax-M2.7/# Model path
MASTER_NODE_ADDR="10.143.3.204:10015"LOCAL_HOST="10.143.3.204"# Service portSTART_PORT=18994START_DEVICE=0LOG_DIR="logs"NNODES=16
for (( i=0; i<$NNODES; i++ ))do PORT=$((START_PORT + i)) DEVICE=$((START_DEVICE + i)) LOG_FILE="$LOG_DIR/node_$i.log" nohup numactl -C $((i*40))-$((i*40+39)) $XLLM_PATH \ --model $MODEL_PATH \ --host $LOCAL_HOST \ --port $PORT \ --devices="npu:$DEVICE" \ --master_node_addr=$MASTER_NODE_ADDR \ --nnodes=$NNODES \ --node_rank=$i \ --max_memory_utilization=0.90 \ --max_tokens_per_batch=8192 \ --max_seqs_per_batch=$BATCH_SIZE \ --communication_backend=hccl \ --enable_chunked_prefill=false \ --enable_prefix_cache=false \ --enable_schedule_overlap=false \ --enable_graph=false \ --enable_atb_spec_kernel=false \ > $LOG_FILE 2>&1 &done