xLLM uses gflags to manage service startup parameters. --model <PATH> is the only required flag. When --config_json_file is used, values in the JSON file override command-line flag values. The tables below are grouped by the Config classes in /xllm/core/framework/config, with one Config per section. The ConfigJsonUtils section contains the common JSON config-file flags.
| Parameter | Type | Default | Description |
|---|
config_json_file | string | "" | Path to a JSON config file. Values in the file override command-line flag values. |
enable_dump_config_json | bool | false | Whether to dump the resolved startup config as JSON. |
dump_config_json_file | string | "xllm_config.json" | Path to write the resolved startup config as JSON. Used only when enable_dump_config_json=true. |
| Parameter | Type | Default | Description |
|---|
host | string | "" | Host name or IP for the bRPC server. |
port | int32 | 8010 | Port for the bRPC server. |
rpc_idle_timeout_s | int32 | -1 | Close the connection when there are no read/write operations during the last rpc_idle_timeout_s seconds. -1 waits indefinitely. |
rpc_channel_timeout_ms | int32 | -1 | Maximum bRPC Channel duration in milliseconds. -1 waits indefinitely. |
max_reconnect_count | int32 | 40 | Maximum number of reconnect attempts from a worker to a server. |
num_threads | int32 | 8 | Number of threads used to process requests. |
max_concurrent_requests | int32 | 200 | Maximum number of concurrent requests the xLLM instance can handle. Set to 0 for no limit. |
num_request_handling_threads | int32 | 4 | Number of threads for handling input requests. |
num_response_handling_threads | int32 | 4 | Number of threads for handling responses. |
health_check_interval_ms | int32 | 3000 | Worker health-check interval in milliseconds. |
| Parameter | Type | Default | Description |
|---|
model_id | string | "" | Hugging Face model name, not a path. |
model | string | "" | Hugging Face model name or model path. |
backend | string | "" | Backend model type: llm for text-only models, vlm for multimodal models, or dit for diffusion models. |
task | string | "generate" | Model task, for example generate, embed, or mm_embed. |
devices | string | "npu:0" | Devices used by the current process, for example npu:0 or npu:0,npu:1. |
limit_image_per_prompt | int32 | 4 | Maximum number of images per prompt. Only applies to multimodal models. |
reasoning_parser | string | "" | Reasoning parser, for example auto, glm45, glm47, glm5, qwen3, qwen35, or deepseek-r1. |
tool_call_parser | string | "" | Tool-call parser, for example auto, qwen25, qwen3, qwen35, qwen3_coder, kimi_k2, deepseekv3, glm45, glm47, or glm5. |
enable_qwen3_reranker | bool | false | Whether to enable the Qwen3 reranker. |
enable_return_mm_full_embeddings | bool | false | Whether VLM models return ViT embeddings and sequence embeddings. |
flashinfer_workspace_buffer_size | int32 | 134217728 | Reserved FlashInfer workspace buffer for intermediate attention results in split-k attention. Default is 128 MiB. |
use_audio_in_video | bool | false | Whether to decode both audio and video when the input is a video. |
use_cpp_chat_template | bool | true | Use native C++ chat templates for supported models, for example deepseek_v32. Set to false to fall back to Jinja for debugging. |
| Parameter | Type | Default | Description |
|---|
enable_manual_loader | bool | false | Pin decoder layer weights to host memory and use async H2D transfer. Required by enable_rolling_load; also implied by enable_xtensor. |
enable_rolling_load | bool | false | Enable rolling weight load: keep only N decoder layer weight slots in HBM and stream-load each layer just in time. Requires enable_manual_loader=true. NPU only. |
rolling_load_num_cached_layers | int32 | 2 | Number of decoder layer weight slots to keep in HBM when enable_rolling_load=true. |
rolling_load_num_rolling_slots | int32 | -1 | Number of rolling slots used by decoder rolling load. Fixed slots are rolling_load_num_cached_layers - rolling_load_num_rolling_slots. -1 means auto, min(2, preload_count). Must be in [-1, rolling_load_num_cached_layers]. |
enable_prefetch_weight | bool | false | Whether to enable weight prefetching. Only applies to Qwen3-dense models. The default gateup weight prefetch ratio is 40%; adjust with PREFETCH_COEFFOCIENT. |
| Parameter | Type | Default | Description |
|---|
block_size | int32 | 128 | Number of slots per KV Cache block. |
max_cache_size | int64 | 0 | Maximum GPU memory size for KV Cache. 0 means calculated from available memory. |
max_memory_utilization | double | 0.8 | Fraction of GPU memory used for model inference, including model weights and KV Cache. |
kv_cache_dtype | string | "auto" | KV Cache dtype for quantization. auto aligns with model dtype and disables quantization. int8 enables INT8 quantization and is only supported on the MLU backend. |
enable_prefix_cache | bool | true | Whether to enable prefix cache in the block manager. See Prefix Cache. |
xxh3_128bits_seed | uint32 | 1024 | Default XXH3 128-bit hash seed. |
enable_xtensor | bool | false | Whether to enable XTensor for model weights with the physical page pool. |
phy_page_granularity_size | int64 | 2097152 | Granularity size of one physical page in bytes, default 2 MiB, for continuous KV Cache. |
| Parameter | Type | Default | Description |
|---|
prefetch_timeout | uint32 | 0 | Timeout for prefetching from KV Cache Store. |
prefetch_batch_size | uint32 | 2 | Copy batch size for prefetching from KV Cache Store. |
layers_wise_copy_batchs | uint32 | 4 | Number of batches for layer-wise H2D copy. |
host_blocks_factor | double | 0.0 | Host block factor, for example host block num = host_blocks_factor * hbm block num. |
enable_kvcache_store | bool | false | Whether to enable KV Cache Store. |
enable_cache_upload | bool | false | Whether to upload cache information to the service. This is only available when service routing is enabled. |
store_protocol | string | "tcp" | KV Cache Store protocol, for example tcp or rdma. |
store_master_server_address | string | "" | Address information of the Store master service. |
store_metadata_server | string | "" | Address of the KV Cache Store metadata service. |
store_local_hostname | string | "" | Local host name of the KV Cache Store client. |
enable_control_h2d_block_num | bool | false | Whether to control the number of H2D copy blocks. |
| Parameter | Type | Default | Description |
|---|
enable_beam_search_kernel | bool | false | Whether to enable the beam search kernel. |
beam_width | int32 | 1 | Beam width for beam search. |
enable_block_copy_kernel | bool | true (NPU/CUDA); false (other backends) | Whether to use the block copy kernel on supported backends. |
enable_topk_sorted | bool | true | Whether to enable sorted top-k output. |
| Parameter | Type | Default | Description |
|---|
max_tokens_per_batch | int32 | 10240 | Maximum number of tokens per batch. |
max_seqs_per_batch | int32 | 1024 | Maximum number of sequences per batch. |
enable_schedule_overlap | bool | false | Whether to enable schedule overlap, also known as asynchronous scheduling. See Async Scheduling. |
prefill_scheduling_memory_usage_threshold | double | 0.95 | Memory usage threshold during prefill scheduling. |
enable_chunked_prefill | bool | true | Whether to enable chunked prefill. |
max_tokens_per_chunk_for_prefill | int32 | -1 | Maximum number of tokens per chunk in the prefill stage. -1 uses the default policy. |
chunked_match_frequency | int32 | 2 | Sequence prefix-cache match frequency. |
use_zero_evict | bool | false | Whether to use ZeroEvictionScheduler. See Zero Evict Scheduler. |
max_decode_token_per_sequence | int32 | 256 | Maximum decode tokens per sequence for ZeroEvictionScheduler. |
priority_strategy | string | "fcfs" | Request priority strategy, for example fcfs, priority, or deadline. |
use_mix_scheduler | bool | false | Whether to use MixScheduler to handle prefill and decode uniformly. |
enable_online_preempt_offline | bool | true | Whether online requests can preempt offline requests. |
aggressive_coeff | double | 1.0 | Aggressive coefficient for MixScheduler urgency judgment. |
starve_threshold | double | 1.0 | Starvation threshold coefficient for MixScheduler. |
enable_starve_prevent | bool | true | Whether to enable anti-starvation behavior in MixScheduler. |
| Parameter | Type | Default | Description |
|---|
dp_size | int32 | 1 | Data parallel size for MLA attention. |
ep_size | int32 | 1 | Expert parallel size for MoE models. |
cp_size | int32 | 1 | Context parallel size for DSA attention. |
kv_split_size | int32 | 0 | KV Cache split width. 0 falls back to cp_size; 1 disables KV split; other K values that divide cp_size shard KV across K ranks. |
prefill_kv_split_size | int32 | 0 | KV Cache split width of the remote prefill instance. Decode nodes use it in PD mode to match the prefill logical block layout. 0 falls back to local cp_size. |
tp_size | int64 | 1 | Tensor parallelism size. Only used for DiT models. |
sp_size | int64 | 1 | Sequence parallelism size. Only used for DiT models. |
cfg_size | int64 | 1 | Classifier-free guidance parallelism size. Only used for DiT models. |
communication_backend | string | "hccl" | NPU communication backend, for example lccl or hccl. Uses hccl when DP is enabled. |
enable_prefill_sp | bool | false | Whether to enable prefill-only sequence parallel. Supports enable_chunked_prefill=true, but only for prefill-only batches (PREFILL / CHUNKED_PREFILL). |
enable_multi_stream_parallel | bool | false | Whether to enable computation/communication parallelism with two streams and two micro batches in the prefill stage. See Multi-Stream Parallelism. |
micro_batch_num | int32 | 1 | Number of micro batches used for multi-stream parallelism. |
enable_dp_balance | bool | false | Whether to enable DP load balancing. If true, sequences within a single DP batch are shuffled. |
| Parameter | Type | Default | Description |
|---|
enable_eplb | bool | false | Whether to enable expert parallel load balance. See EPLB. |
redundant_experts_num | int32 | 1 | Number of redundant experts per device. |
eplb_update_interval | int64 | 1000 | EPLB update interval. |
eplb_update_threshold | double | 0.8 | EPLB update threshold. |
expert_parallel_degree | int32 | 0 | Expert parallel degree. |
rank_tablefile | string | "" | ATB HCCL rank table file. |
| Parameter | Type | Default | Description |
|---|
master_node_addr | string | "127.0.0.1:19888" | Master address for multi-node distributed serving, for example 10.18.1.1:9999. |
xtensor_master_node_addr | string | "127.0.0.1:19889" | Master address for the XTensor distributed service, for example 10.18.1.1:9999. |
nnodes | int32 | 1 | Number of multi-node nodes. |
node_rank | int32 | 0 | Rank of the current node. |
device_ip | string | "" | Device IP address for KV Cache transfer. |
etcd_addr | string | "" | etcd address used to save instance metadata. |
etcd_namespace | string | "" | Optional etcd namespace prefix for all xLLM keys, for example prod-a. |
enable_service_routing | bool | false | Whether to enable xLLM service routing. |
heart_beat_interval | double | 0.5 | Heartbeat interval. |
etcd_ttl | int32 | 3 | Time to live for etcd keys. |
| Parameter | Type | Default | Description |
|---|
enable_disagg_pd | bool | false | Whether to enable disaggregated prefill and decode execution. See P-D Separation. |
enable_pd_ooc | bool | false | Whether to enable online-offline co-location in disaggregated PD mode. |
disagg_pd_port | int32 | 7777 | Listening port for the disaggregated PD bRPC server. |
instance_role | string | "DEFAULT" | Instance role, for example DEFAULT, PREFILL, DECODE, or MIX. |
kv_cache_transfer_type | string | "LlmDataDist" | KV Cache transfer type, for example LlmDataDist, Mooncake, or HCCL. |
kv_cache_transfer_mode | string | "PUSH" | KV Cache transfer mode, for example PUSH or PULL. |
transfer_listen_port | int32 | 26000 | Listening port for KV Cache Transfer. |
kv_push_dst_rotate | bool | true | Rotate the destination-worker traversal order in push_kv_blocks per KV-split rank to spread traffic across decode workers. |
kv_push_timing_log | bool | false | Whether to emit per-step and per-call timing logs for push_kv_blocks. |
| Parameter | Type | Default | Description |
|---|
draft_model | string | "" | Draft model path. See MTP for MTP usage. |
draft_devices | string | "npu:0" | Devices used by the draft model, for example npu:0 or npu:0,npu:1. |
num_speculative_tokens | int32 | 0 | Number of speculative tokens generated per speculative decoding step. |
speculative_algorithm | string | "MTP" | Speculative decoding algorithm. Supported values: MTP, Eagle3, Suffix. |
speculative_suffix_cache_max_depth | int32 | 64 | Maximum suffix-tree depth for suffix speculative decoding. |
speculative_suffix_max_spec_factor | double | 1.0 | Maximum suffix speculation token factor relative to match length. |
speculative_suffix_max_spec_offset | double | 0.0 | Maximum additive token offset for suffix speculation. |
speculative_suffix_min_token_prob | double | 0.1 | Minimum token probability used in suffix speculation. |
speculative_suffix_max_cached_requests | int32 | -1 | Maximum number of globally cached requests for suffix speculation. -1 means unlimited; 0 disables it. |
speculative_suffix_use_tree_spec | bool | false | Whether to use tree-based suffix speculation instead of path speculation. |
enable_opt_validate_probs | bool | false | Whether validation uses selected-only draft_probs [B,S] directly. If false, selected-only cache values are restored to dense [B,S,V]. |
enable_atb_spec_kernel | bool | false | Whether to use the ATB speculative kernel. |
| Parameter | Type | Default | Description |
|---|
enable_profile_step_time | bool | false | Whether to enable step-time profiling. |
enable_profile_token_budget | bool | false | Whether to enable token-budget profiling. |
enable_latency_aware_schedule | bool | false | Whether to use predicted latency for latency-aware scheduling. |
profile_max_prompt_length | int32 | 2048 | Maximum prompt length used for profiling. |
max_global_ttft_ms | int32 | std::numeric_limits<int32_t>::max() | Global TTFT threshold in milliseconds. |
max_global_tpot_ms | int32 | std::numeric_limits<int32_t>::max() | Global TPOT threshold in milliseconds. |
enable_profile_kv_blocks | bool | true | Whether to generate KV Cache blocks for profiling. |
disable_ttft_profiling | bool | false | Whether to disable TTFT profiling. |
enable_forward_interruption | bool | false | Whether to enable forward interruption. |
| Parameter | Type | Default | Description |
|---|
enable_graph | bool | false | Whether to enable graph execution for the decode phase to reduce kernel-launch overhead and device idle time. Supports CUDA Graph, ACL Graph (NPU), and MLU Graph. See Graph Mode. |
enable_graph_mode_decode_no_padding | bool | false | Whether decode graph capture uses the actual num_tokens instead of a padded shape. |
enable_prefill_piecewise_graph | bool | false | Whether to enable piecewise CUDA graph for the prefill phase. Attention runs in eager mode while other operations are captured in CUDA graphs. |
enable_graph_vmm_pool | bool | true | Whether to enable a VMM-backed CUDA graph memory pool for multi-shape graph memory reuse. |
max_tokens_for_graph_mode | int32 | 2048 | Maximum number of tokens for graph execution. 0 means no limit. |
enable_shm | bool | false | Whether to enable shared memory for model execution. |
use_contiguous_input_buffer | bool | true | Whether to use a contiguous device input buffer for model execution. |
input_shm_size | uint64 | 1024 | Input shared-memory size. Default is 1GB. |
output_shm_size | uint64 | 128 | Output shared-memory size. Default is 128MB. |
random_seed | int32 | -1 | Random seed for the random number generator. -1 means no fixed seed. |
| Parameter | Type | Default | Description |
|---|
enable_customize_mla_kernel | bool | false | Whether to enable the customized MLA kernel. |
npu_kernel_backend | string | "AUTO" | NPU kernel backend. Supported values: AUTO, ATB, TORCH. |
enable_intralayer_addnorm | bool | false | Whether to enable fused intralayer addnorm ops. |
| Parameter | Type | Default | Description |
|---|
max_requests_per_batch | int32 | 1 | Maximum number of requests per batch. |
dit_cache_policy | string | "TaylorSeer" | DiT cache policy, for example None, FBCache, TaylorSeer, FBCacheTaylorSeer, or ResidualCache. |
dit_cache_warmup_steps | int64 | 0 | Number of warmup steps. |
dit_cache_n_derivatives | int64 | 3 | Number of derivatives to use in TaylorSeer. |
dit_cache_skip_interval_steps | int64 | 3 | Interval steps to skip for derivative calculation. |
dit_cache_residual_diff_threshold | double | 0.09 | Residual difference threshold for cache reuse. |
dit_cache_start_steps | int64 | 5 | Number of steps to skip at the start. |
dit_cache_end_steps | int64 | 5 | Number of steps to skip at the end. |
dit_cache_start_blocks | int64 | 5 | Number of blocks to skip at the start. |
dit_cache_end_blocks | int64 | 5 | Number of blocks to skip at the end. |
dit_sp_communication_overlap | int64 | 1 | Communication/computation overlap setting for sequence parallelism. |
dit_debug_print | bool | false | Whether to print debug information for DiT models. |
dit_generation_image_area_max | int64 | 0 | Maximum allowed image area (width * height) for image generation requests. 0 means no limit. |
| Parameter | Type | Default | Description |
|---|
enable_rec_fast_sampler | bool | true | Whether to enable the RecSampler fast sampling path for Rec pipelines. |
enable_rec_prefill_only | bool | false | Whether to enable Rec prefill-only mode without decoder self-attention block allocation. |
enable_xattention_one_stage | bool | false | Whether to force xattention one-stage decode for Rec multi-round mode. |
max_decode_rounds | int32 | 0 | Maximum number of decode rounds for multi-step decoding. 0 means disabled. |
enable_constrained_decoding | bool | false | Whether to enable constrained decoding with predefined rules for output format or structure. |
output_rec_logprobs | bool | false | Whether to output Rec multi-round token-aligned logprobs. Missing per-token logprobs are filled with the final beam logprob. |
enable_convert_tokens_to_item | bool | false | Whether to convert token IDs to item IDs in REC/OneRec responses. |
enable_output_sku_logprobs | bool | false | Whether to output REC/OneRec token-aligned logprobs tensors. |
enable_extended_item_info | bool | false | Whether to parse and output REC extended item info tensors. |
each_conversion_threshold | int32 | 50 | Maximum number of items emitted for each REC token triplet. |
total_conversion_threshold | int32 | 1000 | Maximum total number of items emitted in one REC response. |
request_queue_size | int32 | 100000 | Scheduler request queue size. |
rec_worker_max_concurrency | uint32 | 1 | Concurrency for Rec worker parallel execution. Values less than or equal to 1 disable concurrent Rec workers. |