APIServer Parameter Details#

This document provides detailed information about all startup parameters and their usage for LightLLM APIServer.

Basic Configuration Parameters#

--run_mode#

Set the running mode, optional values:

normal: Single server mode (default)
prefill: Prefill mode (for pd disaggregation running mode)
decode: Decode mode (for pd disaggregation running mode)
pd_master: pd master node mode (for pd disaggregation running mode)
config_server: Configuration server mode (for pd disaggregation mode, used to register pd_master nodes and get pd_master node list), specifically designed for large-scale, high-concurrency scenarios, used when pd_master encounters significant CPU bottlenecks.

--host#: Server listening address, default is 127.0.0.1

--port#: Server listening port, default is 8000

--httpserver_workers#: HTTP server worker process count, default is 1

--zmq_mode#

ZMQ communication mode, optional values:

tcp://: TCP mode
ipc:///tmp/: IPC mode (default)

Can only choose from ['tcp://', 'ipc:///tmp/']

PD disaggregation Mode Parameters#

--pd_master_ip#

PD master node IP address, default is 0.0.0.0

This parameter needs to be set when run_mode is set to prefill or decode

--pd_master_port#

PD master node port, default is 1212

This parameter needs to be set when run_mode is set to prefill or decode

--pd_decode_rpyc_port#: Port used by decode nodes for kv move manager rpyc server in PD mode, default is 42000

--config_server_host#: Host address in configuration server mode

--config_server_port#: Port number in configuration server mode

Model Configuration Parameters#

--model_name#

Model name, used to distinguish internal model names, default is default_model_name

Can be obtained via host:port/get_model_name

--model_dir#: Model weight directory path, the application will load configuration, weights, and tokenizer from this directory

--tokenizer_mode#

Tokenizer loading mode, optional values:

slow: Slow mode, loads fast but runs slow, suitable for debugging and testing
fast: Fast mode (default), achieves best performance
auto: Auto mode, tries to use fast mode, falls back to slow mode if it fails

--load_way#

Model weight loading method, default is HF (Huggingface format)

Llama models also support DS (Deepspeed) format

--trust_remote_code#: Whether to allow using custom model definition files on Hub

Memory and Batch Processing Parameters#

--max_total_token_num#

Total token number of kv cache.

If not specified, will be automatically calculated based on mem_fraction

--mem_fraction#

Memory usage ratio, default is 0.9

If OOM occurs during runtime, you can specify a smaller value

--batch_max_tokens#: Maximum token count for new batches, controls prefill batch size to prevent OOM

--running_max_req_size#: Maximum number of requests for simultaneous forward inference, default is 1000

--max_req_total_len#: Maximum value of request input length + request output length, default is 16384

--eos_id#: End stop token ID, can specify multiple values. If None, will be loaded from config.json

--tool_call_parser#

OpenAI interface tool call parser type, optional values:

qwen25
llama3
mistral

Different Parallel Mode Setting Parameters#

--nnodes#: Number of nodes, default is 1

--node_rank#: Current node rank, default is 0

--multinode_httpmanager_port#: Multi-node HTTP manager port, default is 12345

--multinode_router_gloo_port#: Multi-node router gloo port, default is 20001

--tp#: Model tensor parallelism size, default is 1

--dp#

Data parallelism size, default is 1

This is a useful parameter for deepseekv2. When using deepseekv2 model, set dp equal to the tp parameter. In other cases, please do not set it, keep the default value of 1.

--nccl_host#

nccl_host used to build PyTorch distributed environment, default is 127.0.0.1

For multi-node deployment, should be set to the master node’s IP

--nccl_port#: nccl_port used to build PyTorch distributed environment, default is 28765

--use_config_server_to_init_nccl#

Use tcp store server started by config_server to initialize nccl, default is False

When set to True, –nccl_host must equal config_server_host, –nccl_port must be unique for config_server, do not use the same nccl_port for different inference nodes, this will be a serious error

Scheduling Parameters#

--router_token_ratio#: Threshold for determining if the service is busy, default is 0.0. Once the kv cache usage exceeds this value, it will directly switch to conservative scheduling.

--router_max_new_token_len#: The request output length used by the scheduler when evaluating request kv usage, default is 1024, generally lower than the max_new_tokens set by the user. This parameter only takes effect when –router_token_ratio is greater than 0. Setting this parameter will make request scheduling more aggressive, allowing the system to process more requests simultaneously, but will inevitably cause request pause and recalculation.

--router_max_wait_tokens#: Trigger scheduling of new requests every router_max_wait_tokens decoding steps, default is 6

--disable_aggressive_schedule#

Disable aggressive scheduling

Aggressive scheduling may cause frequent prefill interruptions during decoding. Disabling it can make the router_max_wait_tokens parameter work more effectively.

--disable_dynamic_prompt_cache#: Disable kv cache caching

--chunked_prefill_size#: Chunked prefill size, default is 4096

--disable_chunked_prefill#: Whether to disable chunked prefill

--diverse_mode#: Multi-result output mode

--schedule_time_interval#: Schedule time interval, default is 0.03, unit is seconds

Output Constraint Parameters#

--token_healing_mode#

--output_constraint_mode#

Set the output constraint backend, optional values:

outlines: Use outlines backend
xgrammar: Use xgrammar backend
none: No output constraint (default)

--first_token_constraint_mode#: Constrain the allowed range of the first token Use environment variable FIRST_ALLOWED_TOKENS to set the range, e.g., FIRST_ALLOWED_TOKENS=1,2

Multimodal Parameters#

--enable_multimodal#: Whether to allow loading additional visual models

--enable_multimodal_audio#: Whether to allow loading additional audio models (requires –enable_multimodal)

--enable_mps#: Whether to enable nvidia mps for multimodal services

--cache_capacity#: Cache server capacity for multimodal resources, default is 200

--visual_infer_batch_size#: Number of images processed in each inference batch, default is 1

--visual_gpu_ids#: List of GPU IDs to use, e.g., 0 1 2

--visual_tp#: Number of tensor parallel instances for ViT, default is 1

--visual_dp#: Number of data parallel instances for ViT, default is 1

--visual_nccl_ports#: List of NCCL ports for ViT, e.g., 29500 29501 29502, default is [29500]

--vit_att_backend#

Set the attention backend for ViT. Available options:

auto: Automatically select the best backend (default), with priority fa3 > xformers > sdpa > triton
fa3: Use Flash-Attention 3 backend
xformers: Use xformers backend
sdpa: Use sdpa backend
triton: Use Triton backend

Performance Optimization Parameters#

--disable_custom_allreduce#: Whether to disable custom allreduce

--enable_custom_allgather#: Whether to enable custom allgather

--enable_tpsp_mix_mode#

The inference backend will use TP SP mixed running mode

Currently only supports llama and deepseek series models

--enable_prefill_microbatch_overlap#

The inference backend will use microbatch overlap mode for prefill

Currently only supports deepseek series.

--enable_decode_microbatch_overlap#: The inference backend will use microbatch overlap mode for decoding

--llm_prefill_att_backend#

Set the attention backend for the prefill phase. Available options:

auto: Automatically select the best backend (default), with priority fa3 > flashinfer > triton
fa3: Use Flash-Attention 3 backend
flashinfer: Use FlashInfer backend
triton: Use Triton backend

--llm_decode_att_backend#

Set the attention backend for the decode phase. Available options:

auto: Automatically select the best backend (default), with priority fa3 > flashinfer > triton
fa3: Use Flash-Attention 3 backend
flashinfer: Use FlashInfer backend
triton: Use Triton backend

--disable_cudagraph#: Disable cudagraph in the decoding phase

--graph_max_batch_size#: Maximum batch size that can be captured by cuda graph in the decoding phase, default is 256

--graph_split_batch_size#

Controls the interval for generating CUDA graphs during decoding, default is 32

For values from 1 to the specified graph_split_batch_size, CUDA graphs will be generated continuously. For values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size. Properly configuring this parameter can help optimize the performance of CUDA graph execution.

--graph_grow_step_size#: For batch_size values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size, default is 16

--graph_max_len_in_batch#: Maximum sequence length that can be captured by cuda graph in the decoding phase, default is max_req_total_len

Quantization Parameters#

--quant_type#

Quantization method, optional values:

vllm-w8a8
vllm-fp8w8a8
vllm-fp8w8a8-b128
deepgemm-fp8w8a8-b128
triton-fp8w8a8-block128
awq
awq_marlin
none (default)

--quant_cfg#

Path to quantization configuration file. Can be used for mixed quantization.

Examples can be found in test/advanced_config/mixed_quantization/llamacls-mix-down.yaml.

--vit_quant_type#

ViT quantization method, optional values:

vllm-w8a8
vllm-fp8w8a8
none (default)

--vit_quant_cfg#

Path to ViT quantization configuration file. Can be used for mixed quantization.

Examples can be found in lightllm/common/quantization/configs.

Sampling and Generation Parameters#

--sampling_backend#

Implementation used for sampling, optional values:

triton: Use torch and triton kernel (default)
sglang_kernel: Use sglang_kernel implementation

--return_all_prompt_logprobs#: Return logprobs for all prompt tokens

--use_reward_model#: Use reward model

--long_truncation_mode#

How to handle when input_token_len + max_new_tokens > max_req_total_len, optional values:

None: Throw exception (default)
head: Remove some head tokens to make input_token_len + max_new_tokens <= max_req_total_len
center: Remove some tokens at the center position to make input_token_len + max_new_tokens <= max_req_total_len

--use_tgi_api#: Use tgi input and output format

MTP Multi-Prediction Parameters#

--mtp_mode#

Supported mtp modes, it is recommended to use eagle_with_att for better performance, optional values:

vanilla_with_att
eagle_with_att
vanilla_no_att
eagle_no_att
None: Do not enable mtp (default)

--mtp_draft_model_dir#

Path to the draft model for MTP multi-prediction functionality

Used to load the MTP multi-output token model.

--mtp_step#

Specify the number of additional tokens predicted by the draft model, default is 0

Currently this feature only supports DeepSeekV3/R1 models. Increasing this value allows more predictions, but ensure the model is compatible with the specified number of steps. Currently deepseekv3/r1 models only support 1 step

DeepSeek Redundant Expert Parameters#

--ep_redundancy_expert_config_path#: Path to redundant expert configuration. Can be used for deepseekv3 models.

--auto_update_redundancy_expert#: Whether to update redundant experts for deepseekv3 models through online expert usage counters.

Monitoring and Logging Parameters#

--disable_log_stats#: Disable throughput statistics logging

--log_stats_interval#: Interval for recording statistics (seconds), default is 10

--health_monitor#: Check service health status and restart on error

--metric_gateway#: Address for collecting monitoring metrics

--job_name#: Job name for monitoring, default is lightllm

--grouping_key#: Grouping key for monitoring, format is key=value, can specify multiple

--push_interval#: Interval for pushing monitoring metrics (seconds), default is 10

--enable_monitor_auth#: Whether to enable authentication for push_gateway

APIServer Parameter Details

Contents