Skip to content
EN

Service Startup Parameters

xLLM uses gflags to manage service startup parameters. --model <PATH> is the only required flag. When --config_json_file is used, values in the JSON file override command-line flag values. The tables below are grouped by the Config classes in /xllm/core/framework/config, with one Config per section. The ConfigJsonUtils section contains the common JSON config-file flags.

ParameterTypeDefaultDescription
config_json_filestring""Path to a JSON config file. Values in the file override command-line flag values.
enable_dump_config_jsonboolfalseWhether to dump the resolved startup config as JSON.
dump_config_json_filestring"xllm_config.json"Path to write the resolved startup config as JSON. Used only when enable_dump_config_json=true.
ParameterTypeDefaultDescription
hoststring""Host name or IP for the bRPC server.
portint328010Port for the bRPC server.
rpc_idle_timeout_sint32-1Close the connection when there are no read/write operations during the last rpc_idle_timeout_s seconds. -1 waits indefinitely.
rpc_channel_timeout_msint32-1Maximum bRPC Channel duration in milliseconds. -1 waits indefinitely.
max_reconnect_countint3240Maximum number of reconnect attempts from a worker to a server.
num_threadsint328Number of threads used to process requests.
max_concurrent_requestsint32200Maximum number of concurrent requests the xLLM instance can handle. Set to 0 for no limit.
num_request_handling_threadsint324Number of threads for handling input requests.
num_response_handling_threadsint324Number of threads for handling responses.
health_check_interval_msint323000Worker health-check interval in milliseconds.
ParameterTypeDefaultDescription
model_idstring""Hugging Face model name, not a path.
modelstring""Hugging Face model name or model path.
backendstring""Backend model type: llm for text-only models, vlm for multimodal models, or dit for diffusion models.
taskstring"generate"Model task, for example generate, embed, or mm_embed.
devicesstring"npu:0"Devices used by the current process, for example npu:0 or npu:0,npu:1.
limit_image_per_promptint324Maximum number of images per prompt. Only applies to multimodal models.
reasoning_parserstring""Reasoning parser, for example auto, glm45, glm47, glm5, qwen3, qwen35, or deepseek-r1.
tool_call_parserstring""Tool-call parser, for example auto, qwen25, qwen3, qwen35, qwen3_coder, kimi_k2, deepseekv3, glm45, glm47, or glm5.
enable_qwen3_rerankerboolfalseWhether to enable the Qwen3 reranker.
enable_return_mm_full_embeddingsboolfalseWhether VLM models return ViT embeddings and sequence embeddings.
flashinfer_workspace_buffer_sizeint32134217728Reserved FlashInfer workspace buffer for intermediate attention results in split-k attention. Default is 128 MiB.
use_audio_in_videoboolfalseWhether to decode both audio and video when the input is a video.
use_cpp_chat_templatebooltrueUse native C++ chat templates for supported models, for example deepseek_v32. Set to false to fall back to Jinja for debugging.
ParameterTypeDefaultDescription
enable_manual_loaderboolfalsePin decoder layer weights to host memory and use async H2D transfer. Required by enable_rolling_load; also implied by enable_xtensor.
enable_rolling_loadboolfalseEnable rolling weight load: keep only N decoder layer weight slots in HBM and stream-load each layer just in time. Requires enable_manual_loader=true. NPU only.
rolling_load_num_cached_layersint322Number of decoder layer weight slots to keep in HBM when enable_rolling_load=true.
rolling_load_num_rolling_slotsint32-1Number of rolling slots used by decoder rolling load. Fixed slots are rolling_load_num_cached_layers - rolling_load_num_rolling_slots. -1 means auto, min(2, preload_count). Must be in [-1, rolling_load_num_cached_layers].
enable_prefetch_weightboolfalseWhether to enable weight prefetching. Only applies to Qwen3-dense models. The default gateup weight prefetch ratio is 40%; adjust with PREFETCH_COEFFOCIENT.
ParameterTypeDefaultDescription
block_sizeint32128Number of slots per KV Cache block.
max_cache_sizeint640Maximum GPU memory size for KV Cache. 0 means calculated from available memory.
max_memory_utilizationdouble0.8Fraction of GPU memory used for model inference, including model weights and KV Cache.
kv_cache_dtypestring"auto"KV Cache dtype for quantization. auto aligns with model dtype and disables quantization. int8 enables INT8 quantization and is only supported on the MLU backend.
enable_prefix_cachebooltrueWhether to enable prefix cache in the block manager. See Prefix Cache.
xxh3_128bits_seeduint321024Default XXH3 128-bit hash seed.
enable_xtensorboolfalseWhether to enable XTensor for model weights with the physical page pool.
phy_page_granularity_sizeint642097152Granularity size of one physical page in bytes, default 2 MiB, for continuous KV Cache.
ParameterTypeDefaultDescription
prefetch_timeoutuint320Timeout for prefetching from KV Cache Store.
prefetch_batch_sizeuint322Copy batch size for prefetching from KV Cache Store.
layers_wise_copy_batchsuint324Number of batches for layer-wise H2D copy.
host_blocks_factordouble0.0Host block factor, for example host block num = host_blocks_factor * hbm block num.
enable_kvcache_storeboolfalseWhether to enable KV Cache Store.
enable_cache_uploadboolfalseWhether to upload cache information to the service. This is only available when service routing is enabled.
store_protocolstring"tcp"KV Cache Store protocol, for example tcp or rdma.
store_master_server_addressstring""Address information of the Store master service.
store_metadata_serverstring""Address of the KV Cache Store metadata service.
store_local_hostnamestring""Local host name of the KV Cache Store client.
enable_control_h2d_block_numboolfalseWhether to control the number of H2D copy blocks.
ParameterTypeDefaultDescription
enable_beam_search_kernelboolfalseWhether to enable the beam search kernel.
beam_widthint321Beam width for beam search.
enable_block_copy_kernelbooltrue (NPU/CUDA); false (other backends)Whether to use the block copy kernel on supported backends.
enable_topk_sortedbooltrueWhether to enable sorted top-k output.
ParameterTypeDefaultDescription
max_tokens_per_batchint3210240Maximum number of tokens per batch.
max_seqs_per_batchint321024Maximum number of sequences per batch.
enable_schedule_overlapboolfalseWhether to enable schedule overlap, also known as asynchronous scheduling. See Async Scheduling.
prefill_scheduling_memory_usage_thresholddouble0.95Memory usage threshold during prefill scheduling.
enable_chunked_prefillbooltrueWhether to enable chunked prefill.
max_tokens_per_chunk_for_prefillint32-1Maximum number of tokens per chunk in the prefill stage. -1 uses the default policy.
chunked_match_frequencyint322Sequence prefix-cache match frequency.
use_zero_evictboolfalseWhether to use ZeroEvictionScheduler. See Zero Evict Scheduler.
max_decode_token_per_sequenceint32256Maximum decode tokens per sequence for ZeroEvictionScheduler.
priority_strategystring"fcfs"Request priority strategy, for example fcfs, priority, or deadline.
use_mix_schedulerboolfalseWhether to use MixScheduler to handle prefill and decode uniformly.
enable_online_preempt_offlinebooltrueWhether online requests can preempt offline requests.
aggressive_coeffdouble1.0Aggressive coefficient for MixScheduler urgency judgment.
starve_thresholddouble1.0Starvation threshold coefficient for MixScheduler.
enable_starve_preventbooltrueWhether to enable anti-starvation behavior in MixScheduler.
ParameterTypeDefaultDescription
dp_sizeint321Data parallel size for MLA attention.
ep_sizeint321Expert parallel size for MoE models.
cp_sizeint321Context parallel size for DSA attention.
kv_split_sizeint320KV Cache split width. 0 falls back to cp_size; 1 disables KV split; other K values that divide cp_size shard KV across K ranks.
prefill_kv_split_sizeint320KV Cache split width of the remote prefill instance. Decode nodes use it in PD mode to match the prefill logical block layout. 0 falls back to local cp_size.
tp_sizeint641Tensor parallelism size. Only used for DiT models.
sp_sizeint641Sequence parallelism size. Only used for DiT models.
cfg_sizeint641Classifier-free guidance parallelism size. Only used for DiT models.
communication_backendstring"hccl"NPU communication backend, for example lccl or hccl. Uses hccl when DP is enabled.
enable_prefill_spboolfalseWhether to enable prefill-only sequence parallel. Supports enable_chunked_prefill=true, but only for prefill-only batches (PREFILL / CHUNKED_PREFILL).
enable_multi_stream_parallelboolfalseWhether to enable computation/communication parallelism with two streams and two micro batches in the prefill stage. See Multi-Stream Parallelism.
micro_batch_numint321Number of micro batches used for multi-stream parallelism.
enable_dp_balanceboolfalseWhether to enable DP load balancing. If true, sequences within a single DP batch are shuffled.
ParameterTypeDefaultDescription
enable_eplbboolfalseWhether to enable expert parallel load balance. See EPLB.
redundant_experts_numint321Number of redundant experts per device.
eplb_update_intervalint641000EPLB update interval.
eplb_update_thresholddouble0.8EPLB update threshold.
expert_parallel_degreeint320Expert parallel degree.
rank_tablefilestring""ATB HCCL rank table file.
ParameterTypeDefaultDescription
master_node_addrstring"127.0.0.1:19888"Master address for multi-node distributed serving, for example 10.18.1.1:9999.
xtensor_master_node_addrstring"127.0.0.1:19889"Master address for the XTensor distributed service, for example 10.18.1.1:9999.
nnodesint321Number of multi-node nodes.
node_rankint320Rank of the current node.
device_ipstring""Device IP address for KV Cache transfer.
etcd_addrstring""etcd address used to save instance metadata.
etcd_namespacestring""Optional etcd namespace prefix for all xLLM keys, for example prod-a.
enable_service_routingboolfalseWhether to enable xLLM service routing.
heart_beat_intervaldouble0.5Heartbeat interval.
etcd_ttlint323Time to live for etcd keys.
ParameterTypeDefaultDescription
enable_disagg_pdboolfalseWhether to enable disaggregated prefill and decode execution. See P-D Separation.
enable_pd_oocboolfalseWhether to enable online-offline co-location in disaggregated PD mode.
disagg_pd_portint327777Listening port for the disaggregated PD bRPC server.
instance_rolestring"DEFAULT"Instance role, for example DEFAULT, PREFILL, DECODE, or MIX.
kv_cache_transfer_typestring"LlmDataDist"KV Cache transfer type, for example LlmDataDist, Mooncake, or HCCL.
kv_cache_transfer_modestring"PUSH"KV Cache transfer mode, for example PUSH or PULL.
transfer_listen_portint3226000Listening port for KV Cache Transfer.
kv_push_dst_rotatebooltrueRotate the destination-worker traversal order in push_kv_blocks per KV-split rank to spread traffic across decode workers.
kv_push_timing_logboolfalseWhether to emit per-step and per-call timing logs for push_kv_blocks.
ParameterTypeDefaultDescription
draft_modelstring""Draft model path. See MTP for MTP usage.
draft_devicesstring"npu:0"Devices used by the draft model, for example npu:0 or npu:0,npu:1.
num_speculative_tokensint320Number of speculative tokens generated per speculative decoding step.
speculative_algorithmstring"MTP"Speculative decoding algorithm. Supported values: MTP, Eagle3, Suffix.
speculative_suffix_cache_max_depthint3264Maximum suffix-tree depth for suffix speculative decoding.
speculative_suffix_max_spec_factordouble1.0Maximum suffix speculation token factor relative to match length.
speculative_suffix_max_spec_offsetdouble0.0Maximum additive token offset for suffix speculation.
speculative_suffix_min_token_probdouble0.1Minimum token probability used in suffix speculation.
speculative_suffix_max_cached_requestsint32-1Maximum number of globally cached requests for suffix speculation. -1 means unlimited; 0 disables it.
speculative_suffix_use_tree_specboolfalseWhether to use tree-based suffix speculation instead of path speculation.
enable_opt_validate_probsboolfalseWhether validation uses selected-only draft_probs [B,S] directly. If false, selected-only cache values are restored to dense [B,S,V].
enable_atb_spec_kernelboolfalseWhether to use the ATB speculative kernel.
ParameterTypeDefaultDescription
enable_profile_step_timeboolfalseWhether to enable step-time profiling.
enable_profile_token_budgetboolfalseWhether to enable token-budget profiling.
enable_latency_aware_scheduleboolfalseWhether to use predicted latency for latency-aware scheduling.
profile_max_prompt_lengthint322048Maximum prompt length used for profiling.
max_global_ttft_msint32std::numeric_limits<int32_t>::max()Global TTFT threshold in milliseconds.
max_global_tpot_msint32std::numeric_limits<int32_t>::max()Global TPOT threshold in milliseconds.
enable_profile_kv_blocksbooltrueWhether to generate KV Cache blocks for profiling.
disable_ttft_profilingboolfalseWhether to disable TTFT profiling.
enable_forward_interruptionboolfalseWhether to enable forward interruption.
ParameterTypeDefaultDescription
enable_graphboolfalseWhether to enable graph execution for the decode phase to reduce kernel-launch overhead and device idle time. Supports CUDA Graph, ACL Graph (NPU), and MLU Graph. See Graph Mode.
enable_graph_mode_decode_no_paddingboolfalseWhether decode graph capture uses the actual num_tokens instead of a padded shape.
enable_prefill_piecewise_graphboolfalseWhether to enable piecewise CUDA graph for the prefill phase. Attention runs in eager mode while other operations are captured in CUDA graphs.
enable_graph_vmm_poolbooltrueWhether to enable a VMM-backed CUDA graph memory pool for multi-shape graph memory reuse.
max_tokens_for_graph_modeint322048Maximum number of tokens for graph execution. 0 means no limit.
enable_shmboolfalseWhether to enable shared memory for model execution.
use_contiguous_input_bufferbooltrueWhether to use a contiguous device input buffer for model execution.
input_shm_sizeuint641024Input shared-memory size. Default is 1GB.
output_shm_sizeuint64128Output shared-memory size. Default is 128MB.
random_seedint32-1Random seed for the random number generator. -1 means no fixed seed.
ParameterTypeDefaultDescription
enable_customize_mla_kernelboolfalseWhether to enable the customized MLA kernel.
npu_kernel_backendstring"AUTO"NPU kernel backend. Supported values: AUTO, ATB, TORCH.
enable_intralayer_addnormboolfalseWhether to enable fused intralayer addnorm ops.
ParameterTypeDefaultDescription
max_requests_per_batchint321Maximum number of requests per batch.
dit_cache_policystring"TaylorSeer"DiT cache policy, for example None, FBCache, TaylorSeer, FBCacheTaylorSeer, or ResidualCache.
dit_cache_warmup_stepsint640Number of warmup steps.
dit_cache_n_derivativesint643Number of derivatives to use in TaylorSeer.
dit_cache_skip_interval_stepsint643Interval steps to skip for derivative calculation.
dit_cache_residual_diff_thresholddouble0.09Residual difference threshold for cache reuse.
dit_cache_start_stepsint645Number of steps to skip at the start.
dit_cache_end_stepsint645Number of steps to skip at the end.
dit_cache_start_blocksint645Number of blocks to skip at the start.
dit_cache_end_blocksint645Number of blocks to skip at the end.
dit_sp_communication_overlapint641Communication/computation overlap setting for sequence parallelism.
dit_debug_printboolfalseWhether to print debug information for DiT models.
dit_generation_image_area_maxint640Maximum allowed image area (width * height) for image generation requests. 0 means no limit.
ParameterTypeDefaultDescription
enable_rec_fast_samplerbooltrueWhether to enable the RecSampler fast sampling path for Rec pipelines.
enable_rec_prefill_onlyboolfalseWhether to enable Rec prefill-only mode without decoder self-attention block allocation.
enable_xattention_one_stageboolfalseWhether to force xattention one-stage decode for Rec multi-round mode.
max_decode_roundsint320Maximum number of decode rounds for multi-step decoding. 0 means disabled.
enable_constrained_decodingboolfalseWhether to enable constrained decoding with predefined rules for output format or structure.
output_rec_logprobsboolfalseWhether to output Rec multi-round token-aligned logprobs. Missing per-token logprobs are filled with the final beam logprob.
enable_convert_tokens_to_itemboolfalseWhether to convert token IDs to item IDs in REC/OneRec responses.
enable_output_sku_logprobsboolfalseWhether to output REC/OneRec token-aligned logprobs tensors.
enable_extended_item_infoboolfalseWhether to parse and output REC extended item info tensors.
each_conversion_thresholdint3250Maximum number of items emitted for each REC token triplet.
total_conversion_thresholdint321000Maximum total number of items emitted in one REC response.
request_queue_sizeint32100000Scheduler request queue size.
rec_worker_max_concurrencyuint321Concurrency for Rec worker parallel execution. Values less than or equal to 1 disable concurrent Rec workers.