Skip to content
EN

MiniMax-M2.7

The original MiniMax-M2.7 weights are in FP8 format. xLLM supports the following three loading methods:

Method 1: Load FP8 weights directly (online dequantization)

Section titled “Method 1: Load FP8 weights directly (online dequantization)”

Use the original FP8 weight path directly. xLLM will dequantize FP8 to BF16 during inference, so no additional preprocessing is required.

Terminal window
MODEL_PATH=/path/to/MiniMax-M2.7/

Use the tool script to convert FP8 weights to BF16 in advance to avoid the extra overhead of online dequantization:

Terminal window
python tools/dequant_minimax_fp8.py --input-dir /path/to/MiniMax-M2.7/ --output-dir /path/to/MiniMax-M2.7-bf16/

Method 3: Download pre-converted BF16 weights

Section titled “Method 3: Download pre-converted BF16 weights”

Download the dequantized BF16 weights directly:

Terminal window
git clone https://www.modelscope.cn/Eco-Tech/Minimax2.7-BF16-xLLM.git

First, download the image provided by xLLM:

Terminal window
# A3 arm
docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-a3-arm-20260429

Then create the corresponding container:

Terminal window
sudo docker run -it --ipc=host -u 0 --privileged --name xllm_minimax --network=host \
-v /var/queue_schedule:/var/queue_schedule \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v ~/.ssh:/root/.ssh \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /runtime/:/runtime/ -v /etc/hccn.conf:/etc/hccn.conf \
-v /export/home:/export/home \
-v /home/:/home/ \
-w /export/home \
quay.io/jd_xllm/xllm-ai:xllm-dev-a3-arm-20260429

Download the official repository and module dependencies:

Terminal window
git clone https://github.com/jd-opensource/xllm
cd xllm
git checkout preview/minimax-minimal
git submodule init
git submodule update

Download and install dependencies:

Terminal window
pip install --upgrade pre-commit
yum install numactl

Run the build to generate the executable under build/:

Terminal window
python setup.py build

Build artifact path: build/xllm/core/server/xllm

If the service is being started for the first time after the machine has rebooted, initialize the devices first

Section titled “If the service is being started for the first time after the machine has rebooted, initialize the devices first”

If this is skipped and the NPU has not been initialized, the xLLM process may fail to start.

Terminal window
python -c "import torch_npu
for i in range(16):torch_npu.npu.set_device(i)"
Terminal window
##### 1. Configure dependency path environment variables
export PYTHON_INCLUDE_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"
export PYTHON_LIB_PATH="$(python3 -c 'from sysconfig import get_paths; print(get_paths()["include"])')"
export PYTORCH_NPU_INSTALL_PATH=/usr/local/libtorch_npu/
export PYTORCH_INSTALL_PATH="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LIBTORCH_ROOT="$(python3 -c 'import torch, os; print(os.path.dirname(os.path.abspath(torch.__file__)))')"
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/xllm/op_api/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/libtorch_npu/lib:$LD_LIBRARY_PATH
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
##### 2. Configure log-related environment variables
rm -rf /root/atb/log/
rm -rf /root/ascend/log/
rm -rf core.*
export ASDOPS_LOG_LEVEL=ERROR
export ASDOPS_LOG_TO_STDOUT=1
export ASDOPS_LOG_TO_FILE=1
##### 3. Configure performance and communication-related environment variables
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export NPU_MEMORY_FRACTION=0.96
export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3
export ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
export OMP_NUM_THREADS=12
export ALLOW_INTERNAL_FORMAT=1
export ATB_LAYER_INTERNAL_TENSOR_REUSE=1
export ATB_LLM_ENABLE_AUTO_TRANSPOSE=0
export ATB_CONVERT_NCHW_TO_ND=1
export ATB_LAUNCH_KERNEL_WITH_TILING=1
export ATB_OPERATION_EXECUTE_ASYNC=2
export ATB_CONTEXT_WORKSPACE_SIZE=0
export INF_NAN_MODE_ENABLE=1
export HCCL_EXEC_TIMEOUT=0
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_BASE_PORT=2864

Startup Command - MiniMax-M2.7 (single machine, 16 cards, TP=16)

Section titled “Startup Command - MiniMax-M2.7 (single machine, 16 cards, TP=16)”
Terminal window
BATCH_SIZE=256
# Maximum inference batch size
XLLM_PATH="build/xllm/core/server/xllm"
# Inference entry binary path, which is the build artifact from the previous step
MODEL_PATH=/path/to/MiniMax-M2.7/
# Model path
MASTER_NODE_ADDR="10.143.3.204:10015"
LOCAL_HOST="10.143.3.204"
# Service port
START_PORT=18994
START_DEVICE=0
LOG_DIR="logs"
NNODES=16
for (( i=0; i<$NNODES; i++ ))
do
PORT=$((START_PORT + i))
DEVICE=$((START_DEVICE + i))
LOG_FILE="$LOG_DIR/node_$i.log"
nohup numactl -C $((i*40))-$((i*40+39)) $XLLM_PATH \
--model $MODEL_PATH \
--host $LOCAL_HOST \
--port $PORT \
--devices="npu:$DEVICE" \
--master_node_addr=$MASTER_NODE_ADDR \
--nnodes=$NNODES \
--node_rank=$i \
--max_memory_utilization=0.90 \
--max_tokens_per_batch=8192 \
--max_seqs_per_batch=$BATCH_SIZE \
--communication_backend=hccl \
--enable_chunked_prefill=false \
--enable_prefix_cache=false \
--enable_schedule_overlap=false \
--enable_graph=false \
--enable_atb_spec_kernel=false \
> $LOG_FILE 2>&1 &
done