APIServer Parameter Details#
This document provides detailed information about all startup parameters and their usage for LightLLM APIServer.
Basic Configuration Parameters#
- --run_mode#
Set the running mode, optional values:
normal: Single server mode (default)prefill: Prefill mode (for pd disaggregation running mode)decode: Decode mode (for pd disaggregation running mode)pd_master: pd master node mode (for pd disaggregation running mode)config_server: Configuration server mode (for pd disaggregation mode, used to register pd_master nodes and get pd_master node list), specifically designed for large-scale, high-concurrency scenarios, used when pd_master encounters significant CPU bottlenecks.
- --host#
Server listening address, default is
127.0.0.1
- --port#
Server listening port, default is
8000
- --httpserver_workers#
HTTP server worker process count, default is
1
- --zmq_mode#
ZMQ communication mode, optional values:
tcp://: TCP modeipc:///tmp/: IPC mode (default)
Can only choose from
['tcp://', 'ipc:///tmp/']
PD disaggregation Mode Parameters#
- --pd_master_ip#
PD master node IP address, default is
0.0.0.0This parameter needs to be set when run_mode is set to prefill or decode
- --pd_master_port#
PD master node port, default is
1212This parameter needs to be set when run_mode is set to prefill or decode
- --pd_decode_rpyc_port#
Port used by decode nodes for kv move manager rpyc server in PD mode, default is
42000
- --config_server_host#
Host address in configuration server mode
- --config_server_port#
Port number in configuration server mode
Model Configuration Parameters#
- --model_name#
Model name, used to distinguish internal model names, default is
default_model_nameCan be obtained via
host:port/get_model_name
- --model_dir#
Model weight directory path, the application will load configuration, weights, and tokenizer from this directory
- --tokenizer_mode#
Tokenizer loading mode, optional values:
slow: Slow mode, loads fast but runs slow, suitable for debugging and testingfast: Fast mode (default), achieves best performanceauto: Auto mode, tries to use fast mode, falls back to slow mode if it fails
- --load_way#
Model weight loading method, default is
HF(Huggingface format)Llama models also support
DS(Deepspeed) format
- --trust_remote_code#
Whether to allow using custom model definition files on Hub
Memory and Batch Processing Parameters#
- --max_total_token_num#
Total token number of kv cache.
If not specified, will be automatically calculated based on mem_fraction
- --mem_fraction#
Memory usage ratio, default is
0.9If OOM occurs during runtime, you can specify a smaller value
- --batch_max_tokens#
Maximum token count for new batches, controls prefill batch size to prevent OOM
- --running_max_req_size#
Maximum number of requests for simultaneous forward inference, default is
1000
- --max_req_total_len#
Maximum value of request input length + request output length, default is
16384
- --eos_id#
End stop token ID, can specify multiple values. If None, will be loaded from config.json
- --tool_call_parser#
OpenAI interface tool call parser type, optional values:
qwen25llama3mistral
Different Parallel Mode Setting Parameters#
- --nnodes#
Number of nodes, default is
1
- --node_rank#
Current node rank, default is
0
- --multinode_httpmanager_port#
Multi-node HTTP manager port, default is
12345
- --multinode_router_gloo_port#
Multi-node router gloo port, default is
20001
- --tp#
Model tensor parallelism size, default is
1
- --dp#
Data parallelism size, default is
1This is a useful parameter for deepseekv2. When using deepseekv2 model, set dp equal to the tp parameter. In other cases, please do not set it, keep the default value of 1.
- --nccl_host#
nccl_host used to build PyTorch distributed environment, default is
127.0.0.1For multi-node deployment, should be set to the master node’s IP
- --nccl_port#
nccl_port used to build PyTorch distributed environment, default is
28765
- --use_config_server_to_init_nccl#
Use tcp store server started by config_server to initialize nccl, default is False
When set to True, –nccl_host must equal config_server_host, –nccl_port must be unique for config_server, do not use the same nccl_port for different inference nodes, this will be a serious error
Scheduling Parameters#
- --router_token_ratio#
Threshold for determining if the service is busy, default is
0.0. Once the kv cache usage exceeds this value, it will directly switch to conservative scheduling.
- --router_max_new_token_len#
The request output length used by the scheduler when evaluating request kv usage, default is
1024, generally lower than the max_new_tokens set by the user. This parameter only takes effect when –router_token_ratio is greater than 0. Setting this parameter will make request scheduling more aggressive, allowing the system to process more requests simultaneously, but will inevitably cause request pause and recalculation.
- --router_max_wait_tokens#
Trigger scheduling of new requests every router_max_wait_tokens decoding steps, default is
6
- --disable_aggressive_schedule#
Disable aggressive scheduling
Aggressive scheduling may cause frequent prefill interruptions during decoding. Disabling it can make the router_max_wait_tokens parameter work more effectively.
- --disable_dynamic_prompt_cache#
Disable kv cache caching
- --chunked_prefill_size#
Chunked prefill size, default is
4096
- --disable_chunked_prefill#
Whether to disable chunked prefill
- --diverse_mode#
Multi-result output mode
- --schedule_time_interval#
Schedule time interval, default is
0.03, unit is seconds
Output Constraint Parameters#
- --token_healing_mode#
- --output_constraint_mode#
Set the output constraint backend, optional values:
outlines: Use outlines backendxgrammar: Use xgrammar backendnone: No output constraint (default)
- --first_token_constraint_mode#
Constrain the allowed range of the first token Use environment variable FIRST_ALLOWED_TOKENS to set the range, e.g., FIRST_ALLOWED_TOKENS=1,2
Multimodal Parameters#
- --enable_multimodal#
Whether to allow loading additional visual models
- --enable_multimodal_audio#
Whether to allow loading additional audio models (requires –enable_multimodal)
- --enable_mps#
Whether to enable nvidia mps for multimodal services
- --cache_capacity#
Cache server capacity for multimodal resources, default is
200
- --visual_infer_batch_size#
Number of images processed in each inference batch, default is
1
- --visual_gpu_ids#
List of GPU IDs to use, e.g., 0 1 2
- --visual_tp#
Number of tensor parallel instances for ViT, default is
1
- --visual_dp#
Number of data parallel instances for ViT, default is
1
- --visual_nccl_ports#
List of NCCL ports for ViT, e.g., 29500 29501 29502, default is [29500]
- --vit_att_backend#
Set the attention backend for ViT. Available options:
auto: Automatically select the best backend (default), with priority fa3 > xformers > sdpa > tritonfa3: Use Flash-Attention 3 backendxformers: Use xformers backendsdpa: Use sdpa backendtriton: Use Triton backend
Performance Optimization Parameters#
- --disable_custom_allreduce#
Whether to disable custom allreduce
- --enable_custom_allgather#
Whether to enable custom allgather
- --enable_tpsp_mix_mode#
The inference backend will use TP SP mixed running mode
Currently only supports llama and deepseek series models
- --enable_prefill_microbatch_overlap#
The inference backend will use microbatch overlap mode for prefill
Currently only supports deepseek series.
- --enable_decode_microbatch_overlap#
The inference backend will use microbatch overlap mode for decoding
- --llm_prefill_att_backend#
Set the attention backend for the prefill phase. Available options:
auto: Automatically select the best backend (default), with priority fa3 > flashinfer > tritonfa3: Use Flash-Attention 3 backendflashinfer: Use FlashInfer backendtriton: Use Triton backend
- --llm_decode_att_backend#
Set the attention backend for the decode phase. Available options:
auto: Automatically select the best backend (default), with priority fa3 > flashinfer > tritonfa3: Use Flash-Attention 3 backendflashinfer: Use FlashInfer backendtriton: Use Triton backend
- --disable_cudagraph#
Disable cudagraph in the decoding phase
- --graph_max_batch_size#
Maximum batch size that can be captured by cuda graph in the decoding phase, default is
256
- --graph_split_batch_size#
Controls the interval for generating CUDA graphs during decoding, default is
32For values from 1 to the specified graph_split_batch_size, CUDA graphs will be generated continuously. For values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size. Properly configuring this parameter can help optimize the performance of CUDA graph execution.
- --graph_grow_step_size#
For batch_size values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size, default is
16
- --graph_max_len_in_batch#
Maximum sequence length that can be captured by cuda graph in the decoding phase, default is
max_req_total_len
Quantization Parameters#
- --quant_type#
Quantization method, optional values:
vllm-w8a8vllm-fp8w8a8vllm-fp8w8a8-b128deepgemm-fp8w8a8-b128triton-fp8w8a8-block128awqawq_marlinnone(default)
- --quant_cfg#
Path to quantization configuration file. Can be used for mixed quantization.
Examples can be found in test/advanced_config/mixed_quantization/llamacls-mix-down.yaml.
- --vit_quant_type#
ViT quantization method, optional values:
vllm-w8a8vllm-fp8w8a8none(default)
- --vit_quant_cfg#
Path to ViT quantization configuration file. Can be used for mixed quantization.
Examples can be found in lightllm/common/quantization/configs.
Sampling and Generation Parameters#
- --sampling_backend#
Implementation used for sampling, optional values:
triton: Use torch and triton kernel (default)sglang_kernel: Use sglang_kernel implementation
- --return_all_prompt_logprobs#
Return logprobs for all prompt tokens
- --use_reward_model#
Use reward model
- --long_truncation_mode#
How to handle when input_token_len + max_new_tokens > max_req_total_len, optional values:
None: Throw exception (default)head: Remove some head tokens to make input_token_len + max_new_tokens <= max_req_total_lencenter: Remove some tokens at the center position to make input_token_len + max_new_tokens <= max_req_total_len
- --use_tgi_api#
Use tgi input and output format
MTP Multi-Prediction Parameters#
- --mtp_mode#
Supported mtp modes, it is recommended to use eagle_with_att for better performance, optional values:
vanilla_with_atteagle_with_attvanilla_no_atteagle_no_attNone: Do not enable mtp (default)
- --mtp_draft_model_dir#
Path to the draft model for MTP multi-prediction functionality
Used to load the MTP multi-output token model.
- --mtp_step#
Specify the number of additional tokens predicted by the draft model, default is
0Currently this feature only supports DeepSeekV3/R1 models. Increasing this value allows more predictions, but ensure the model is compatible with the specified number of steps. Currently deepseekv3/r1 models only support 1 step
DeepSeek Redundant Expert Parameters#
- --ep_redundancy_expert_config_path#
Path to redundant expert configuration. Can be used for deepseekv3 models.
- --auto_update_redundancy_expert#
Whether to update redundant experts for deepseekv3 models through online expert usage counters.
Monitoring and Logging Parameters#
- --disable_log_stats#
Disable throughput statistics logging
- --log_stats_interval#
Interval for recording statistics (seconds), default is
10
- --health_monitor#
Check service health status and restart on error
- --metric_gateway#
Address for collecting monitoring metrics
- --job_name#
Job name for monitoring, default is
lightllm
- --grouping_key#
Grouping key for monitoring, format is key=value, can specify multiple
- --push_interval#
Interval for pushing monitoring metrics (seconds), default is
10
- --enable_monitor_auth#
Whether to enable authentication for push_gateway