DeepSeek Model Deployment Guide#
LightLLM supports various deployment solutions for DeepSeek models, including DeepSeek-R1, DeepSeek-V2, DeepSeek-V3, etc. This document provides detailed information on various deployment modes and configuration solutions.
Deployment Mode Overview#
LightLLM supports the following deployment modes:
Single node TP Mode: Deploy using tensor parallelism on a single node
Single node EP Mode: Deploy using expert parallelism on a single node
Multi-node TP Mode: Use tensor parallelism across multiple nodes
Multi-node EP Mode: Use expert parallelism across multiple nodes
PD disaggregation Mode: Separate prefill and decode deployment
Multi PD Master Mode: Support multiple PD Master nodes
1. Single node Deployment Solutions#
1.1 Single node TP Mode (Tensor Parallel)#
Suitable for deploying DeepSeek-R1 model on a single H200 node.
Launch Command:
# H200 Single node DeepSeek-R1 TP Mode
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--enable_fa3
Parameter Description: - LOADWORKER=18: Model loading thread count, improves loading speed - –tp 8: Tensor parallelism, using 8 GPUs - –enable_fa3: Enable Flash Attention 3.0 - –port 8088: Service port
1.2 Single node DP + EP Mode (Data Parallel + Expert Parallel)#
Suitable for expert parallelism deployment of MoE models like DeepSeek-V2/V3.
Launch Command:
# H200 Single node DeepSeek-R1 DP + EP Mode
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--dp 8 \
--enable_fa3
Parameter Description: - MOE_MODE=EP: Set expert parallelism mode - –tp 8: Tensor parallelism - –dp 8: Data parallelism, usually set to the same value as tp - –enable_fa3: Enable Flash Attention 3.0
Optional Optimization Parameters: - –enable_prefill_microbatch_overlap: Enable prefill microbatch overlap - –enable_decode_microbatch_overlap: Enable decode microbatch overlap
2. Multi-node Deployment Solutions#
2.1 Multi-node TP Mode#
Suitable for deployment across multiple H200/H100 nodes.
Node 0 Launch Command:
# H200/H100 Multi-node DeepSeek-R1 TP Mode Node 0
# Usage: sh multi_node_tp_node0.sh <nccl_host>
export nccl_host=$1
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
--nccl_port 2732
Node 1 Launch Command:
# H200/H100 Multi-node DeepSeek-R1 TP Mode Node 1
# Usage: sh multi_node_tp_node1.sh <nccl_host>
export nccl_host=$1
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
--nccl_port 2732
Parameter Description: - –nnodes 2: Total number of nodes - –node_rank 0/1: Current node rank - –nccl_host: NCCL communication host address - –nccl_port 2732: NCCL communication port
2.2 Multi-node EP Mode#
Suitable for deploying MoE models across multiple nodes.
Node 0 Launch Command:
# H200 Multi-node DeepSeek-R1 EP Mode Node 0
# Usage: sh multi_node_ep_node0.sh <nccl_host>
export nccl_host=$1
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
--nccl_port 2732
Node 1 Launch Command:
# H200 Multi-node DeepSeek-R1 EP Mode Node 1
# Usage: sh multi_node_ep_node1.sh <nccl_host>
export nccl_host=$1
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
--nccl_port 2732
Optional Optimization Parameters: - –enable_prefill_microbatch_overlap: Enable prefill microbatch overlap - –enable_decode_microbatch_overlap: Enable decode microbatch overlap
3. PD disaggregation Deployment Solutions#
PD (Prefill-Decode) disaggregation mode separates prefill and decode stages for deployment, which can better utilize hardware resources.
3.1 Single PD Master Mode#
Step 1: Launch PD Master Service
# PD Master for DeepSeek-R1
# Usage: sh pd_master.sh <pd_master_ip>
export pd_master_ip=$1
python -m lightllm.server.api_server --model_dir /path/DeepSeek-R1 \
--run_mode "pd_master" \
--host $pd_master_ip \
--port 60011
Step 2: Launch Prefill Service
# PD prefill mode for DeepSeek-R1 (DP+EP) on H200
# Usage: sh pd_prefill.sh <host> <pd_master_ip>
# nvidia-cuda-mps-control -d, run MPS (optional, performance will be much better with mps support, but some GPUs may encounter errors when enabling mps, it's recommended to upgrade to a higher driver version, especially for H-series cards)
export host=$1
export pd_master_ip=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP KV_TRANS_USE_P2P=1 LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "prefill" \
--tp 8 \
--dp 8 \
--host $host \
--port 8019 \
--nccl_port 2732 \
--enable_fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip
Step 3: Launch Decode Service
# PD decode mode for DeepSeek-R1 (DP+EP) on H200
# Usage: sh pd_decode.sh <host> <pd_master_ip>
export host=$1
export pd_master_ip=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP KV_TRANS_USE_P2P=1 LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "decode" \
--tp 8 \
--dp 8 \
--host $host \
--port 8121 \
--nccl_port 12322 \
--enable_fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
# if you want to enable microbatch overlap, you can uncomment the following lines
#--enable_decode_microbatch_overlap
3.2 Multi PD Master Mode#
Supports multiple PD Master nodes, providing better load balancing and high availability.
Step 1: Launch Config Server
# Config Server
# Usage: sh config_server.sh <config_server_host>
export config_server_host=$1
python -m lightllm.server.api_server \
--run_mode "config_server" \
--config_server_host $config_server_host \
--config_server_port 60088
Step 2: Launch Multiple PD Masters
# PD Master 1
# Usage: sh pd_master_1.sh <host> <config_server_host>
export host=$1
export config_server_host=$2
python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "pd_master" \
--host $host \
--port 60011 \
--config_server_host $config_server_host \
--config_server_port 60088
# PD Master 2
# Usage: sh pd_master_2.sh <host> <config_server_host>
export host=$1
export config_server_host=$2
python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "pd_master" \
--host $host \
--port 60012 \
--config_server_host $config_server_host \
--config_server_port 60088
Step 3: Launch Prefill and Decode Services
# Prefill Service
export host=$1
export config_server_host=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "prefill" \
--host $host \
--port 8019 \
--tp 8 \
--dp 8 \
--nccl_port 2732 \
--enable_fa3 \
--disable_cudagraph \
--config_server_host $config_server_host \
--config_server_port 60088
# if you want to enable microbatch overlap, you can uncomment the following lines
#--enable_prefill_microbatch_overlap
# Decode Service
export host=$1
export config_server_host=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "decode" \
--host $host \
--port 8121 \
--nccl_port 12322 \
--tp 8 \
--dp 8 \
--enable_fa3 \
--config_server_host $config_server_host \
--config_server_port 60088
# if you want to enable microbatch overlap, you can uncomment the following lines
#--enable_decode_microbatch_overlap
4. Testing and Validation#
4.1 Basic Functionality Testing#
curl http://server_ip:server_port/generate \
-H "Content-Type: application/json" \
-d '{
"inputs": "What is AI?",
"parameters":{
"max_new_tokens":17,
"frequency_penalty":1
}
}'
4.2 Performance Benchmark Testing#
# DeepSeek-R1 Performance Testing
cd test
python benchmark_client.py \
--num_clients 100 \
--input_num 2000 \
--tokenizer_path /path/DeepSeek-R1/ \
--url http://127.0.0.1:8088/generate_stream
All the above scripts can be referenced from the scripts in the test/start_scripts/multi_pd_master/ directory.