Reasoning Parser#
LightLLM supports parsing reasoning model’s thinking process, separating internal reasoning from final answers to improve AI system transparency.
Supported Models#
DeepSeek-R1#
Parser: deepseek-r1
Format:
<think>
Reasoning process...
</think>
Final answer
Features: Forced reasoning mode, some variants may omit <think> opening tag
Startup:
python -m lightllm.server.api_server \
--model_dir /path/to/DeepSeek-R1 \
--reasoning_parser deepseek-r1 \
--tp 8
DeepSeek-V3#
Parser: deepseek-v3
Format: Same as Qwen3
Startup:
python -m lightllm.server.api_server \
--model_dir /path/to/DeepSeek-V3 \
--reasoning_parser deepseek-v3 \
--tp 8
Request Config:
data = {
"chat_template_kwargs": {"thinking": True} # Enable reasoning
}
Qwen3#
Parser: qwen3
Format: <think>Reasoning content</think>Answer
Features: Optional reasoning mode, supports dynamic switching
# Enable reasoning
data = {"chat_template_kwargs": {"enable_thinking": True}}
GPT-OSS#
Parser: gpt-oss
Format:
<|start|><|channel|>analysis<|message|>
Reasoning analysis...
<|end|>
<|channel|>final<|message|>
Final answer
<|return|>
Features: Complex state machine parsing, supports multiple channels (analysis, commentary, final)
Basic Usage#
Non-Streaming#
import requests
import json
url = "http://localhost:8088/v1/chat/completions"
data = {
"model": "deepseek-r1",
"messages": [
{"role": "user", "content": "How many 'r's in 'strawberry'?"}
],
"max_tokens": 2000,
"separate_reasoning": True, # Separate reasoning content
"chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post(url, json=data).json()
message = response["choices"][0]["message"]
print("Reasoning:", message.get("reasoning_content"))
print("Answer:", message.get("content"))
Streaming#
data = {
"model": "deepseek-r1",
"messages": [{"role": "user", "content": "Explain quantum entanglement"}],
"stream": True,
"separate_reasoning": True,
"stream_reasoning": True # Stream reasoning content in real-time
}
response = requests.post(url, json=data, stream=True)
for line in response.iter_lines():
if line and line.startswith(b"data: "):
data_str = line[6:].decode('utf-8')
if data_str == '[DONE]':
break
chunk = json.loads(data_str)
delta = chunk["choices"][0]["delta"]
# Reasoning content
if "reasoning_content" in delta:
print(delta["reasoning_content"], end="", flush=True)
# Answer content
if "content" in delta:
print(delta["content"], end="", flush=True)
Response Format#
Non-Streaming:
{
"choices": [{
"message": {
"content": "Final answer",
"reasoning_content": "Reasoning process"
}
}]
}
Streaming:
// Reasoning chunk
{"choices": [{"delta": {"reasoning_content": "Reasoning fragment"}}]}
// Answer chunk
{"choices": [{"delta": {"content": "Answer fragment"}}]}
Advanced Features#
Dynamic Reasoning Mode#
# Enable reasoning
data = {
"chat_template_kwargs": {"enable_thinking": True},
"separate_reasoning": True
}
# Disable reasoning
data = {
"chat_template_kwargs": {"enable_thinking": False}
}
Control Reasoning Display#
# Hide reasoning streaming
data = {
"separate_reasoning": True,
"stream_reasoning": False # reasoning_content field still exists
}
# Merge reasoning and answer
data = {
"separate_reasoning": False # Merged in content field
}
Integration with Tool Calling#
data = {
"model": "deepseek-r1",
"tools": tools,
"tool_choice": "auto",
"separate_reasoning": True,
"chat_template_kwargs": {"enable_thinking": True}
}
response = requests.post(url, json=data).json()
message = response["choices"][0]["message"]
# Get reasoning, tool calls, and answer simultaneously
print("Reasoning:", message.get("reasoning_content"))
print("Tools:", message.get("tool_calls"))
print("Answer:", message.get("content"))
Multi-Turn Reasoning#
messages = [{"role": "user", "content": "What is a prime number?"}]
# First turn
response1 = requests.post(url, json={
"messages": messages,
"separate_reasoning": True
}).json()
message1 = response1["choices"][0]["message"]
messages.append({
"role": "assistant",
"content": message1["content"],
"reasoning_content": message1.get("reasoning_content")
})
# Second turn
messages.append({"role": "user", "content": "Is 17 a prime number?"})
response2 = requests.post(url, json={
"messages": messages,
"separate_reasoning": True
}).json()
Configuration#
- separate_reasoning (bool, default: True)
Whether to separate reasoning content into
reasoning_contentfield- stream_reasoning (bool, default: False)
Whether to stream reasoning content in real-time
- chat_template_kwargs (object)
enable_thinking: Enable reasoning (Qwen3, GLM45)thinking: Enable reasoning (DeepSeek-V3)
- –reasoning_parser (startup parameter)
Specify parser type:
deepseek-r1,qwen3,glm45,gpt-oss, etc.
Common Issues#
- Reasoning content not separated
Check
--reasoning_parser,separate_reasoning: true,chat_template_kwargs- Model not generating reasoning
Confirm model supports reasoning mode, check if reasoning parameters are enabled
- Incomplete streaming
Process all chunks, wait for
[DONE]signal- Conflict with tool calling
Use latest version (includes PR #1158), configure parameters correctly
Performance#
Token Consumption: Reasoning mode may increase token usage by 3-5x
Latency Impact: TTFB may increase from 200ms to 800ms
Optimization:
- Use stream_reasoning: true to reduce perceived latency
- Disable reasoning mode for non-critical tasks
Technical Details#
Core Files:
- lightllm/server/reasoning_parser.py - Parser implementation
- lightllm/server/api_openai.py - API integration
- test/test_api/test_openai_api.py - Test examples
Related PRs: - PR #1154: Add reasoning parser - PR #1158: Function call in reasoning content support
References#
DeepSeek-R1 Technical Report
LightLLM GitHub: ModelTC/lightllm