Multi-Node Deployment
This example demonstrates how to launch a 32-GPU (NPU) deployment across 2 machines. Launching Services on the First Machine:
bash start_deepseek_machine_1.shThe start_deepseek_machine_1.sh script is as follows:
#!/bin/bashset -e
rm -rf core.*
source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.shexport HCCL_IF_BASE_PORT=43432 # HCCL communication base port
# 4. Start distributed serviceMODEL_PATH="/path/to/your/DeepSeek-R1" # Model pathMASTER_NODE_ADDR="123.123.123.123:9748" # Master node address (must be globally consistent)LOCAL_HOST=123.123.123.123 # Local IP for service launchSTART_PORT=18000 # Service starting portSTART_DEVICE=0 # Starting NPU logical device numberLOCAL_NODES=16 # Number of local processes (this script launches 16 processes)LOG_DIR="log" # Log directoryNNODES=32 # Total number of GPUs/NPUs (32 in this 2-machine example)
mkdir -p $LOG_DIR
for (( i=0; i<$LOCAL_NODES; i++ ))do PORT=$((START_PORT + i)) DEVICE=$((START_DEVICE + i)) LOG_FILE="$LOG_DIR/node_$i.log" /path/to/xllm \ --model $MODEL_PATH \ --host $LOCAL_HOST \ --port $PORT \ --devices="npu:$DEVICE" \ --master_node_addr=$MASTER_NODE_ADDR \ --nnodes=$NNODES \ --max_memory_utilization=0.86 \ --max_tokens_per_batch=40000 \ --max_seqs_per_batch=256 \ --block_size=128 \ --enable_prefix_cache=false \ --enable_chunked_prefill=false \ --communication_backend="hccl" \ --enable_schedule_overlap=true \ --rank_tablefile=./ranktable_2s_32p.json \ --node_rank=$i \ > $LOG_FILE 2>&1 &doneLaunching Services on the Second Machine:
bash start_deepseek_machine_2.shThe start_deepseek_machine_2.sh script is as follows:
#!/bin/bashset -e
rm -rf core.*
source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.shexport HCCL_IF_BASE_PORT=43432 # HCCL communication base port
MODEL_PATH="/path/to/your/DeepSeek-R1" # Model pathMASTER_NODE_ADDR="123.123.123.123:9748" # Master node address (must be globally consistent)LOCAL_HOST=456.456.456.456 # Local IP for service launchSTART_PORT=18000 # Service starting portSTART_DEVICE=0 # Starting NPU logical device numberLOCAL_NODES=16 # Number of local processes (this script launches 16 processes)LOG_DIR="log" # Log directoryNNODES=32 # Total number of GPUs/NPUs (32 in this 2-machine example)
mkdir -p $LOG_DIR
for (( i=0; i<$LOCAL_NODES; i++ ))do PORT=$((START_PORT + i)) DEVICE=$((START_DEVICE + i)) LOG_FILE="$LOG_DIR/node_$i.log" /path/to/xllm \ --model $MODEL_PATH \ --host $LOCAL_HOST \ --port $PORT \ --devices="npu:$DEVICE" \ --master_node_addr=$MASTER_NODE_ADDR \ --nnodes=$NNODES \ --max_memory_utilization=0.86 \ --max_tokens_per_batch=40000 \ --max_seqs_per_batch=256 \ --block_size=128 \ --enable_prefix_cache=false \ --enable_chunked_prefill=false \ --communication_backend="hccl" \ --enable_schedule_overlap=true \ --rank_tablefile=./ranktable_2s_32p.json \ --node_rank=$((i + LOCAL_NODES)) \ > $LOG_FILE 2>&1 &doneThis example uses 2 machines. You can set the total number of GPUs/NPUs via --nnodes, where --node_rank specifies the global rank ID for each node.
The --rank_tablefile=./ranktable_2s_32p.jsonparameter points to the configuration file required for establishing the NPU communication domain. For instructions on generating this file, refer to Ranktable Generation.