APIServer Parameter Details#
This document provides detailed information about all startup parameters and their usage for LightLLM APIServer.
Basic Configuration Parameters#
- --run_mode#
Set the running mode, optional values:
normal: Single server mode (default)prefill: Prefill mode (for pd disaggregation running mode)decode: Decode mode (for pd disaggregation running mode)pd_master: pd master node mode (for pd disaggregation running mode)config_server: Configuration server mode (for pd disaggregation mode, used to register pd_master nodes and get pd_master node list), specifically designed for large-scale, high-concurrency scenarios, used when pd_master encounters significant CPU bottlenecks.
- --host#
Server listening address, default is
127.0.0.1
- --port#
Server listening port, default is
8000
- --httpserver_workers#
HTTP server worker process count, default is
1
- --zmq_mode#
ZMQ communication mode, optional values:
tcp://: TCP modeipc:///tmp/: IPC mode (default)
Can only choose from
['tcp://', 'ipc:///tmp/']
PD disaggregation Mode Parameters#
- --pd_master_ip#
PD master node IP address, default is
0.0.0.0This parameter needs to be set when run_mode is set to prefill or decode
- --pd_master_port#
PD master node port, default is
1212This parameter needs to be set when run_mode is set to prefill or decode
- --pd_decode_rpyc_port#
Port used by decode nodes for kv move manager rpyc server in PD mode, default is
42000
- --config_server_host#
Host address in configuration server mode
- --config_server_port#
Port number in configuration server mode
Model Configuration Parameters#
- --model_name#
Model name, used to distinguish internal model names, default is
default_model_nameCan be obtained via
host:port/get_model_name
- --model_dir#
Model weight directory path, the application will load configuration, weights, and tokenizer from this directory
- --tokenizer_mode#
Tokenizer loading mode, optional values:
slow: Slow mode, loads fast but runs slow, suitable for debugging and testingfast: Fast mode (default), achieves best performanceauto: Auto mode, tries to use fast mode, falls back to slow mode if it fails
- --load_way#
Model weight loading method, default is
HF(Huggingface format)Llama models also support
DS(Deepspeed) format
- --trust_remote_code#
Whether to allow using custom model definition files on Hub
Memory and Batch Processing Parameters#
- --max_total_token_num#
Total token number of kv cache.
If not specified, will be automatically calculated based on mem_fraction
- --mem_fraction#
Memory usage ratio, default is
0.9If OOM occurs during runtime, you can specify a smaller value
- --batch_max_tokens#
Maximum token count for new batches, controls prefill batch size to prevent OOM
- --running_max_req_size#
Maximum number of requests for simultaneous forward inference, default is
1000
- --max_req_total_len#
Maximum value of request input length + request output length, default is
16384
- --eos_id#
End stop token ID, can specify multiple values. If None, will be loaded from config.json
- --tool_call_parser#
OpenAI interface tool call parser type, optional values:
qwen25llama3mistral
Different Parallel Mode Setting Parameters#
- --nnodes#
Number of nodes, default is
1
- --node_rank#
Current node rank, default is
0
- --multinode_httpmanager_port#
Multi-node HTTP manager port, default is
12345
- --multinode_router_gloo_port#
Multi-node router gloo port, default is
20001
- --tp#
Model tensor parallelism size, default is
1
- --dp#
Data parallelism size, default is
1This is a useful parameter for deepseekv2. When using deepseekv2 model, set dp equal to the tp parameter. In other cases, please do not set it, keep the default value of 1.
- --nccl_host#
nccl_host used to build PyTorch distributed environment, default is
127.0.0.1For multi-node deployment, should be set to the master node’s IP
- --nccl_port#
nccl_port used to build PyTorch distributed environment, default is
28765
- --use_config_server_to_init_nccl#
Use tcp store server started by config_server to initialize nccl, default is False
When set to True, –nccl_host must equal config_server_host, –nccl_port must be unique for config_server, do not use the same nccl_port for different inference nodes, this will be a serious error
Attention Type Selection Parameters#
- --mode#
Model inference mode, can specify multiple values:
triton_int8kv: Use int8 to store kv cache, can increase token capacity, uses triton kernelppl_int8kv: Use int8 to store kv cache, uses ppl fast kernelppl_fp16: Use ppl fast fp16 decode attention kerneltriton_flashdecoding: Flashdecoding mode for long context, currently supports llama llama2 qwentriton_gqa_attention: Fast kernel for models using GQAtriton_gqa_flashdecoding: Fast flashdecoding kernel for models using GQAtriton_fp8kv: Use float8 to store kv cache, currently only used for deepseek2
Need to read source code to confirm specific modes supported by all models
Scheduling Parameters#
- --router_token_ratio#
Threshold for determining if the service is busy, default is
0.0. Once the kv cache usage exceeds this value, it will directly switch to conservative scheduling.
- --router_max_new_token_len#
The request output length used by the scheduler when evaluating request kv usage, default is
1024, generally lower than the max_new_tokens set by the user. This parameter only takes effect when –router_token_ratio is greater than 0. Setting this parameter will make request scheduling more aggressive, allowing the system to process more requests simultaneously, but will inevitably cause request pause and recalculation.
- --router_max_wait_tokens#
Trigger scheduling of new requests every router_max_wait_tokens decoding steps, default is
6
- --disable_aggressive_schedule#
Disable aggressive scheduling
Aggressive scheduling may cause frequent prefill interruptions during decoding. Disabling it can make the router_max_wait_tokens parameter work more effectively.
- --disable_dynamic_prompt_cache#
Disable kv cache caching
- --chunked_prefill_size#
Chunked prefill size, default is
4096
- --disable_chunked_prefill#
Whether to disable chunked prefill
- --diverse_mode#
Multi-result output mode
- --schedule_time_interval#
Schedule time interval, default is
0.03, unit is seconds
Output Constraint Parameters#
- --token_healing_mode#
- --output_constraint_mode#
Set the output constraint backend, optional values:
outlines: Use outlines backendxgrammar: Use xgrammar backendnone: No output constraint (default)
- --first_token_constraint_mode#
Constrain the allowed range of the first token Use environment variable FIRST_ALLOWED_TOKENS to set the range, e.g., FIRST_ALLOWED_TOKENS=1,2
Multimodal Parameters#
- --enable_multimodal#
Whether to allow loading additional visual models
- --enable_multimodal_audio#
Whether to allow loading additional audio models (requires –enable_multimodal)
- --enable_mps#
Whether to enable nvidia mps for multimodal services
- --cache_capacity#
Cache server capacity for multimodal resources, default is
200
- --visual_infer_batch_size#
Number of images processed in each inference batch, default is
1
- --visual_gpu_ids#
List of GPU IDs to use, e.g., 0 1 2
- --visual_tp#
Number of tensor parallel instances for ViT, default is
1
- --visual_dp#
Number of data parallel instances for ViT, default is
1
- --visual_nccl_ports#
List of NCCL ports for ViT, e.g., 29500 29501 29502, default is [29500]
Performance Optimization Parameters#
- --disable_custom_allreduce#
Whether to disable custom allreduce
- --enable_custom_allgather#
Whether to enable custom allgather
- --enable_tpsp_mix_mode#
The inference backend will use TP SP mixed running mode
Currently only supports llama and deepseek series models
- --enable_prefill_microbatch_overlap#
The inference backend will use microbatch overlap mode for prefill
Currently only supports deepseek series models
- --enable_decode_microbatch_overlap#
The inference backend will use microbatch overlap mode for decoding
- --enable_flashinfer_prefill#
The inference backend will use flashinfer’s attention kernel for prefill
- --enable_flashinfer_decode#
The inference backend will use flashinfer’s attention kernel for decoding
- --enable_fa3#
The inference backend will use fa3 attention kernel for prefill and decoding
- --disable_cudagraph#
Disable cudagraph in the decoding phase
- --graph_max_batch_size#
Maximum batch size that can be captured by cuda graph in the decoding phase, default is
256
- --graph_split_batch_size#
Controls the interval for generating CUDA graphs during decoding, default is
32For values from 1 to the specified graph_split_batch_size, CUDA graphs will be generated continuously. For values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size. Properly configuring this parameter can help optimize the performance of CUDA graph execution.
- --graph_grow_step_size#
For batch_size values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size, default is
16
- --graph_max_len_in_batch#
Maximum sequence length that can be captured by cuda graph in the decoding phase, default is
max_req_total_len
Quantization Parameters#
- --quant_type#
Quantization method, optional values:
ppl-w4a16-128flashllm-w6a16ao-int4wo-[32,64,128,256]ao-int8woao-fp8w8a16ao-fp6w6a16vllm-w8a8vllm-fp8w8a8vllm-fp8w8a8-b128triton-fp8w8a8-block128none(default)
- --quant_cfg#
Path to quantization configuration file. Can be used for mixed quantization.
Examples can be found in test/advanced_config/mixed_quantization/llamacls-mix-down.yaml.
- --vit_quant_type#
ViT quantization method, optional values:
ppl-w4a16-128flashllm-w6a16ao-int4wo-[32,64,128,256]ao-int8woao-fp8w8a16ao-fp6w6a16vllm-w8a8vllm-fp8w8a8none(default)
- --vit_quant_cfg#
Path to ViT quantization configuration file. Can be used for mixed quantization.
Examples can be found in lightllm/common/quantization/configs.
Sampling and Generation Parameters#
- --sampling_backend#
Implementation used for sampling, optional values:
triton: Use torch and triton kernel (default)sglang_kernel: Use sglang_kernel implementation
- --return_all_prompt_logprobs#
Return logprobs for all prompt tokens
- --use_reward_model#
Use reward model
- --long_truncation_mode#
How to handle when input_token_len + max_new_tokens > max_req_total_len, optional values:
None: Throw exception (default)head: Remove some head tokens to make input_token_len + max_new_tokens <= max_req_total_lencenter: Remove some tokens at the center position to make input_token_len + max_new_tokens <= max_req_total_len
- --use_tgi_api#
Use tgi input and output format
MTP Multi-Prediction Parameters#
- --mtp_mode#
Supported mtp modes, optional values:
deepseekv3None: Do not enable mtp (default)
- --mtp_draft_model_dir#
Path to the draft model for MTP multi-prediction functionality
Used to load the MTP multi-output token model.
- --mtp_step#
Specify the number of additional tokens predicted by the draft model, default is
0Currently this feature only supports DeepSeekV3/R1 models. Increasing this value allows more predictions, but ensure the model is compatible with the specified number of steps. Currently deepseekv3/r1 models only support 1 step
DeepSeek Redundant Expert Parameters#
- --ep_redundancy_expert_config_path#
Path to redundant expert configuration. Can be used for deepseekv3 models.
- --auto_update_redundancy_expert#
Whether to update redundant experts for deepseekv3 models through online expert usage counters.
Monitoring and Logging Parameters#
- --disable_log_stats#
Disable throughput statistics logging
- --log_stats_interval#
Interval for recording statistics (seconds), default is
10
- --health_monitor#
Check service health status and restart on error
- --metric_gateway#
Address for collecting monitoring metrics
- --job_name#
Job name for monitoring, default is
lightllm
- --grouping_key#
Grouping key for monitoring, format is key=value, can specify multiple
- --push_interval#
Interval for pushing monitoring metrics (seconds), default is
10
- --enable_monitor_auth#
Whether to enable authentication for push_gateway