Reasoning Parser#

LightLLM supports parsing reasoning model’s thinking process, separating internal reasoning from final answers to improve AI system transparency.

Supported Models#

DeepSeek-R1#

Parser: deepseek-r1

Format:

<think>
Reasoning process...
</think>
Final answer

Features: Forced reasoning mode, some variants may omit <think> opening tag

Startup:

python -m lightllm.server.api_server \
    --model_dir /path/to/DeepSeek-R1 \
    --reasoning_parser deepseek-r1 \
    --tp 8

DeepSeek-V3#

Parser: deepseek-v3

Format: Same as Qwen3

Startup:

python -m lightllm.server.api_server \
    --model_dir /path/to/DeepSeek-V3 \
    --reasoning_parser deepseek-v3 \
    --tp 8

Request Config:

data = {
    "chat_template_kwargs": {"thinking": True}  # Enable reasoning
}

Qwen3#

Parser: qwen3

Format: <think>Reasoning content</think>Answer

Features: Optional reasoning mode, supports dynamic switching

# Enable reasoning
data = {"chat_template_kwargs": {"enable_thinking": True}}

GPT-OSS#

Parser: gpt-oss

Format:

<|start|><|channel|>analysis<|message|>
Reasoning analysis...
<|end|>
<|channel|>final<|message|>
Final answer
<|return|>

Features: Complex state machine parsing, supports multiple channels (analysis, commentary, final)

Basic Usage#

Non-Streaming#

import requests
import json

url = "http://localhost:8088/v1/chat/completions"
data = {
    "model": "deepseek-r1",
    "messages": [
        {"role": "user", "content": "How many 'r's in 'strawberry'?"}
    ],
    "max_tokens": 2000,
    "separate_reasoning": True,  # Separate reasoning content
    "chat_template_kwargs": {"enable_thinking": True}
}

response = requests.post(url, json=data).json()
message = response["choices"][0]["message"]

print("Reasoning:", message.get("reasoning_content"))
print("Answer:", message.get("content"))

Streaming#

data = {
    "model": "deepseek-r1",
    "messages": [{"role": "user", "content": "Explain quantum entanglement"}],
    "stream": True,
    "separate_reasoning": True,
    "stream_reasoning": True  # Stream reasoning content in real-time
}

response = requests.post(url, json=data, stream=True)

for line in response.iter_lines():
    if line and line.startswith(b"data: "):
        data_str = line[6:].decode('utf-8')
        if data_str == '[DONE]':
            break

        chunk = json.loads(data_str)
        delta = chunk["choices"][0]["delta"]

        # Reasoning content
        if "reasoning_content" in delta:
            print(delta["reasoning_content"], end="", flush=True)

        # Answer content
        if "content" in delta:
            print(delta["content"], end="", flush=True)

Response Format#

Non-Streaming:

{
    "choices": [{
        "message": {
            "content": "Final answer",
            "reasoning_content": "Reasoning process"
        }
    }]
}

Streaming:

// Reasoning chunk
{"choices": [{"delta": {"reasoning_content": "Reasoning fragment"}}]}

// Answer chunk
{"choices": [{"delta": {"content": "Answer fragment"}}]}

Advanced Features#

Dynamic Reasoning Mode#

# Enable reasoning
data = {
    "chat_template_kwargs": {"enable_thinking": True},
    "separate_reasoning": True
}

# Disable reasoning
data = {
    "chat_template_kwargs": {"enable_thinking": False}
}

Control Reasoning Display#

# Hide reasoning streaming
data = {
    "separate_reasoning": True,
    "stream_reasoning": False  # reasoning_content field still exists
}

# Merge reasoning and answer
data = {
    "separate_reasoning": False  # Merged in content field
}

Integration with Tool Calling#

data = {
    "model": "deepseek-r1",
    "tools": tools,
    "tool_choice": "auto",
    "separate_reasoning": True,
    "chat_template_kwargs": {"enable_thinking": True}
}

response = requests.post(url, json=data).json()
message = response["choices"][0]["message"]

# Get reasoning, tool calls, and answer simultaneously
print("Reasoning:", message.get("reasoning_content"))
print("Tools:", message.get("tool_calls"))
print("Answer:", message.get("content"))

Multi-Turn Reasoning#

messages = [{"role": "user", "content": "What is a prime number?"}]

# First turn
response1 = requests.post(url, json={
    "messages": messages,
    "separate_reasoning": True
}).json()

message1 = response1["choices"][0]["message"]
messages.append({
    "role": "assistant",
    "content": message1["content"],
    "reasoning_content": message1.get("reasoning_content")
})

# Second turn
messages.append({"role": "user", "content": "Is 17 a prime number?"})
response2 = requests.post(url, json={
    "messages": messages,
    "separate_reasoning": True
}).json()

Configuration#

separate_reasoning (bool, default: True)

Whether to separate reasoning content into reasoning_content field

stream_reasoning (bool, default: False)

Whether to stream reasoning content in real-time

chat_template_kwargs (object)
  • enable_thinking: Enable reasoning (Qwen3, GLM45)

  • thinking: Enable reasoning (DeepSeek-V3)

–reasoning_parser (startup parameter)

Specify parser type: deepseek-r1, qwen3, glm45, gpt-oss, etc.

Common Issues#

Reasoning content not separated

Check --reasoning_parser, separate_reasoning: true, chat_template_kwargs

Model not generating reasoning

Confirm model supports reasoning mode, check if reasoning parameters are enabled

Incomplete streaming

Process all chunks, wait for [DONE] signal

Conflict with tool calling

Use latest version (includes PR #1158), configure parameters correctly

Performance#

Token Consumption: Reasoning mode may increase token usage by 3-5x

Latency Impact: TTFB may increase from 200ms to 800ms

Optimization: - Use stream_reasoning: true to reduce perceived latency - Disable reasoning mode for non-critical tasks

Technical Details#

Core Files: - lightllm/server/reasoning_parser.py - Parser implementation - lightllm/server/api_openai.py - API integration - test/test_api/test_openai_api.py - Test examples

Related PRs: - PR #1154: Add reasoning parser - PR #1158: Function call in reasoning content support

References#