NVIDIA GPU
Use the CUDA backend when running xLLM on NVIDIA GPUs.
Image and Container Startup
Section titled “Image and Container Startup”Pull the CUDA development image:
docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-cuda-x86Start the container:
sudo docker run -it \--privileged \--shm-size '128gb' \--ipc=host \--net=host \--pid=host \--name=xllm-cuda \-v $HOME:$HOME \-w $HOME \<docker_image_name> \/bin/bashServer Startup Command
Section titled “Server Startup Command”Set visible CUDA devices before launching the service:
#!/bin/bashset -e
rm -rf core.*
export CUDA_VISIBLE_DEVICES=0# for debug# export CUDA_LAUNCH_BLOCKING=1
MODEL_PATH="/path/to/model/Qwen3-8B"MASTER_NODE_ADDR="127.0.0.1:9748"START_PORT=18000START_DEVICE=0LOG_DIR="log"NNODES=1
mkdir -p $LOG_DIR
for (( i=0; i<$NNODES; i++ ))do PORT=$((START_PORT + i)) DEVICE=$((START_DEVICE + i)) LOG_FILE="$LOG_DIR/node_$i.log" xllm \ --model $MODEL_PATH \ --devices="cuda:$DEVICE" \ --port $PORT \ --master_node_addr=$MASTER_NODE_ADDR \ --nnodes=$NNODES \ --block_size=32 \ --max_memory_utilization=0.8 \ --enable_prefix_cache=false \ --enable_chunked_prefill=false \ --enable_schedule_overlap=true \ --node_rank=$i \ > $LOG_FILE 2>&1 &doneFor a single-device run, <device-id> usually starts from 0. For multi-device or multi-node runs, keep the device ids, --node_rank, --nnodes, and service ports consistent with the launch topology.
- The CUDA development image and Dockerfile references are maintained in Quick Start.
- The default CUDA launch example uses
--block_size=32and--max_memory_utilization=0.8; adjust these values according to the target model and available GPU memory. - CUDA timeline collection is available through the profiling endpoints described in Online Profiling.