APIServer Parameter Details#

This document provides detailed information about all startup parameters and their usage for LightLLM APIServer.

Basic Configuration Parameters#

--run_mode#

Set the running mode, optional values:

  • normal: Single server mode (default)

  • prefill: Prefill mode (for pd disaggregation running mode)

  • decode: Decode mode (for pd disaggregation running mode)

  • pd_master: pd master node mode (for pd disaggregation running mode)

  • config_server: Configuration server mode (for pd disaggregation mode, used to register pd_master nodes and get pd_master node list), specifically designed for large-scale, high-concurrency scenarios, used when pd_master encounters significant CPU bottlenecks.

--host#

Server listening address, default is 127.0.0.1

--port#

Server listening port, default is 8000

--httpserver_workers#

HTTP server worker process count, default is 1

--zmq_mode#

ZMQ communication mode, optional values:

  • tcp://: TCP mode

  • ipc:///tmp/: IPC mode (default)

Can only choose from ['tcp://', 'ipc:///tmp/']

PD disaggregation Mode Parameters#

--pd_master_ip#

PD master node IP address, default is 0.0.0.0

This parameter needs to be set when run_mode is set to prefill or decode

--pd_master_port#

PD master node port, default is 1212

This parameter needs to be set when run_mode is set to prefill or decode

--pd_decode_rpyc_port#

Port used by decode nodes for kv move manager rpyc server in PD mode, default is 42000

--config_server_host#

Host address in configuration server mode

--config_server_port#

Port number in configuration server mode

Model Configuration Parameters#

--model_name#

Model name, used to distinguish internal model names, default is default_model_name

Can be obtained via host:port/get_model_name

--model_dir#

Model weight directory path, the application will load configuration, weights, and tokenizer from this directory

--tokenizer_mode#

Tokenizer loading mode, optional values:

  • slow: Slow mode, loads fast but runs slow, suitable for debugging and testing

  • fast: Fast mode (default), achieves best performance

  • auto: Auto mode, tries to use fast mode, falls back to slow mode if it fails

--load_way#

Model weight loading method, default is HF (Huggingface format)

Llama models also support DS (Deepspeed) format

--trust_remote_code#

Whether to allow using custom model definition files on Hub

Memory and Batch Processing Parameters#

--max_total_token_num#

Total token number of kv cache.

If not specified, will be automatically calculated based on mem_fraction

--mem_fraction#

Memory usage ratio, default is 0.9

If OOM occurs during runtime, you can specify a smaller value

--batch_max_tokens#

Maximum token count for new batches, controls prefill batch size to prevent OOM

--running_max_req_size#

Maximum number of requests for simultaneous forward inference, default is 1000

--max_req_total_len#

Maximum value of request input length + request output length, default is 16384

--eos_id#

End stop token ID, can specify multiple values. If None, will be loaded from config.json

--tool_call_parser#

OpenAI interface tool call parser type, optional values:

  • qwen25

  • llama3

  • mistral

Different Parallel Mode Setting Parameters#

--nnodes#

Number of nodes, default is 1

--node_rank#

Current node rank, default is 0

--multinode_httpmanager_port#

Multi-node HTTP manager port, default is 12345

--multinode_router_gloo_port#

Multi-node router gloo port, default is 20001

--tp#

Model tensor parallelism size, default is 1

--dp#

Data parallelism size, default is 1

This is a useful parameter for deepseekv2. When using deepseekv2 model, set dp equal to the tp parameter. In other cases, please do not set it, keep the default value of 1.

--nccl_host#

nccl_host used to build PyTorch distributed environment, default is 127.0.0.1

For multi-node deployment, should be set to the master node’s IP

--nccl_port#

nccl_port used to build PyTorch distributed environment, default is 28765

--use_config_server_to_init_nccl#

Use tcp store server started by config_server to initialize nccl, default is False

When set to True, –nccl_host must equal config_server_host, –nccl_port must be unique for config_server, do not use the same nccl_port for different inference nodes, this will be a serious error

Attention Type Selection Parameters#

--mode#

Model inference mode, can specify multiple values:

  • triton_int8kv: Use int8 to store kv cache, can increase token capacity, uses triton kernel

  • ppl_int8kv: Use int8 to store kv cache, uses ppl fast kernel

  • ppl_fp16: Use ppl fast fp16 decode attention kernel

  • triton_flashdecoding: Flashdecoding mode for long context, currently supports llama llama2 qwen

  • triton_gqa_attention: Fast kernel for models using GQA

  • triton_gqa_flashdecoding: Fast flashdecoding kernel for models using GQA

  • triton_fp8kv: Use float8 to store kv cache, currently only used for deepseek2

Need to read source code to confirm specific modes supported by all models

Scheduling Parameters#

--router_token_ratio#

Threshold for determining if the service is busy, default is 0.0. Once the kv cache usage exceeds this value, it will directly switch to conservative scheduling.

--router_max_new_token_len#

The request output length used by the scheduler when evaluating request kv usage, default is 1024, generally lower than the max_new_tokens set by the user. This parameter only takes effect when –router_token_ratio is greater than 0. Setting this parameter will make request scheduling more aggressive, allowing the system to process more requests simultaneously, but will inevitably cause request pause and recalculation.

--router_max_wait_tokens#

Trigger scheduling of new requests every router_max_wait_tokens decoding steps, default is 6

--disable_aggressive_schedule#

Disable aggressive scheduling

Aggressive scheduling may cause frequent prefill interruptions during decoding. Disabling it can make the router_max_wait_tokens parameter work more effectively.

--disable_dynamic_prompt_cache#

Disable kv cache caching

--chunked_prefill_size#

Chunked prefill size, default is 4096

--disable_chunked_prefill#

Whether to disable chunked prefill

--diverse_mode#

Multi-result output mode

--schedule_time_interval#

Schedule time interval, default is 0.03, unit is seconds

Output Constraint Parameters#

--token_healing_mode#
--output_constraint_mode#

Set the output constraint backend, optional values:

  • outlines: Use outlines backend

  • xgrammar: Use xgrammar backend

  • none: No output constraint (default)

--first_token_constraint_mode#

Constrain the allowed range of the first token Use environment variable FIRST_ALLOWED_TOKENS to set the range, e.g., FIRST_ALLOWED_TOKENS=1,2

Multimodal Parameters#

--enable_multimodal#

Whether to allow loading additional visual models

--enable_multimodal_audio#

Whether to allow loading additional audio models (requires –enable_multimodal)

--enable_mps#

Whether to enable nvidia mps for multimodal services

--cache_capacity#

Cache server capacity for multimodal resources, default is 200

--visual_infer_batch_size#

Number of images processed in each inference batch, default is 1

--visual_gpu_ids#

List of GPU IDs to use, e.g., 0 1 2

--visual_tp#

Number of tensor parallel instances for ViT, default is 1

--visual_dp#

Number of data parallel instances for ViT, default is 1

--visual_nccl_ports#

List of NCCL ports for ViT, e.g., 29500 29501 29502, default is [29500]

Performance Optimization Parameters#

--disable_custom_allreduce#

Whether to disable custom allreduce

--enable_custom_allgather#

Whether to enable custom allgather

--enable_tpsp_mix_mode#

The inference backend will use TP SP mixed running mode

Currently only supports llama and deepseek series models

--enable_prefill_microbatch_overlap#

The inference backend will use microbatch overlap mode for prefill

Currently only supports deepseek series models

--enable_decode_microbatch_overlap#

The inference backend will use microbatch overlap mode for decoding

--enable_flashinfer_prefill#

The inference backend will use flashinfer’s attention kernel for prefill

--enable_flashinfer_decode#

The inference backend will use flashinfer’s attention kernel for decoding

--enable_fa3#

The inference backend will use fa3 attention kernel for prefill and decoding

--disable_cudagraph#

Disable cudagraph in the decoding phase

--graph_max_batch_size#

Maximum batch size that can be captured by cuda graph in the decoding phase, default is 256

--graph_split_batch_size#

Controls the interval for generating CUDA graphs during decoding, default is 32

For values from 1 to the specified graph_split_batch_size, CUDA graphs will be generated continuously. For values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size. Properly configuring this parameter can help optimize the performance of CUDA graph execution.

--graph_grow_step_size#

For batch_size values from graph_split_batch_size to graph_max_batch_size, a new CUDA graph will be generated for every increase of graph_grow_step_size, default is 16

--graph_max_len_in_batch#

Maximum sequence length that can be captured by cuda graph in the decoding phase, default is max_req_total_len

Quantization Parameters#

--quant_type#

Quantization method, optional values:

  • ppl-w4a16-128

  • flashllm-w6a16

  • ao-int4wo-[32,64,128,256]

  • ao-int8wo

  • ao-fp8w8a16

  • ao-fp6w6a16

  • vllm-w8a8

  • vllm-fp8w8a8

  • vllm-fp8w8a8-b128

  • triton-fp8w8a8-block128

  • none (default)

--quant_cfg#

Path to quantization configuration file. Can be used for mixed quantization.

Examples can be found in test/advanced_config/mixed_quantization/llamacls-mix-down.yaml.

--vit_quant_type#

ViT quantization method, optional values:

  • ppl-w4a16-128

  • flashllm-w6a16

  • ao-int4wo-[32,64,128,256]

  • ao-int8wo

  • ao-fp8w8a16

  • ao-fp6w6a16

  • vllm-w8a8

  • vllm-fp8w8a8

  • none (default)

--vit_quant_cfg#

Path to ViT quantization configuration file. Can be used for mixed quantization.

Examples can be found in lightllm/common/quantization/configs.

Sampling and Generation Parameters#

--sampling_backend#

Implementation used for sampling, optional values:

  • triton: Use torch and triton kernel (default)

  • sglang_kernel: Use sglang_kernel implementation

--return_all_prompt_logprobs#

Return logprobs for all prompt tokens

--use_reward_model#

Use reward model

--long_truncation_mode#

How to handle when input_token_len + max_new_tokens > max_req_total_len, optional values:

  • None: Throw exception (default)

  • head: Remove some head tokens to make input_token_len + max_new_tokens <= max_req_total_len

  • center: Remove some tokens at the center position to make input_token_len + max_new_tokens <= max_req_total_len

--use_tgi_api#

Use tgi input and output format

MTP Multi-Prediction Parameters#

--mtp_mode#

Supported mtp modes, optional values:

  • deepseekv3

  • None: Do not enable mtp (default)

--mtp_draft_model_dir#

Path to the draft model for MTP multi-prediction functionality

Used to load the MTP multi-output token model.

--mtp_step#

Specify the number of additional tokens predicted by the draft model, default is 0

Currently this feature only supports DeepSeekV3/R1 models. Increasing this value allows more predictions, but ensure the model is compatible with the specified number of steps. Currently deepseekv3/r1 models only support 1 step

DeepSeek Redundant Expert Parameters#

--ep_redundancy_expert_config_path#

Path to redundant expert configuration. Can be used for deepseekv3 models.

--auto_update_redundancy_expert#

Whether to update redundant experts for deepseekv3 models through online expert usage counters.

Monitoring and Logging Parameters#

--disable_log_stats#

Disable throughput statistics logging

--log_stats_interval#

Interval for recording statistics (seconds), default is 10

--health_monitor#

Check service health status and restart on error

--metric_gateway#

Address for collecting monitoring metrics

--job_name#

Job name for monitoring, default is lightllm

--grouping_key#

Grouping key for monitoring, format is key=value, can specify multiple

--push_interval#

Interval for pushing monitoring metrics (seconds), default is 10

--enable_monitor_auth#

Whether to enable authentication for push_gateway