Multi-Level Cache Deployment Guide#
LightLLM supports a multi-level KV Cache mechanism. By combining three cache levels - GPU (L1), CPU (L2), and Disk (L3) - it can significantly reduce deployment costs and improve throughput for long-text scenarios. This document provides detailed instructions on configuring and using the multi-level cache functionality.
Prerequisites#
Using L3 cache requires the LightMem library. LightMem is a high-performance KV Cache disk management library specifically designed for large language model inference systems.
Note
If you only use two-level caching (L1 + L2), i.e., GPU + CPU cache, LightMem is not required.
LightMem support is only needed when the --enable_disk_cache parameter is enabled.
Installing LightMem#
Source Code Location:
Installation:
For detailed installation instructions, refer to the LightMem repository’s README documentation.
Multi-Level Cache Architecture#
LightLLM’s multi-level cache system adopts a hierarchical design:
L1 Cache (GPU Memory): The fastest cache layer, storing hot request KV Cache with the lowest latency
L2 Cache (CPU Memory): Medium-speed cache layer, storing relatively cold KV Cache at lower cost than GPU
L3 Cache (Disk Storage): Maximum capacity cache layer, storing long-term inactive KV Cache at the lowest cost
Working Principle:
The current mechanism creates an exact backup copy of GPU cache data in CPU cache, not just storing content that doesn’t fit in GPU cache
L1, L2, and L3 caches all use LRU eviction strategy for data management
To avoid frequent disk writes in L3 cache, you can use the LIGHTLLM_DISK_CACHE_PROMPT_LIMIT_LENGTH environment variable to control the minimum length threshold for writes. If set to 0, all L2 data will be written to L3 cache
During queries, L1 is checked first to find the longest matching prefix, then L2 is queried to continue extending the longest matching prefix, and finally L3 is queried for the remaining part
Applicable Scenarios:
Ultra-long text processing (e.g., million-token level context)
High-concurrency conversation scenarios (requiring caching of large amounts of conversation history)
Cost-sensitive deployments (replacing expensive GPU memory with cheaper RAM and disk)
Prompt Cache scenarios (reusing common prompt prefixes)
Deployment Solutions#
1. L1 + L2 Two-Level Cache (GPU + CPU)#
Suitable for most scenarios, significantly increasing cache capacity while maintaining high performance.
Startup Command:
# Enable GPU + CPU two-level cache
LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
--cpu_cache_token_page_size 64
Parameter Description:
Basic Parameters#
LOADWORKER=18: Number of model loading threads to speed up model loading. Recommended to set to half the number of CPU cores--model_dir: Model file path, supports local path or HuggingFace model name--tp 8: Tensor parallelism degree, using 8 GPUs for model inference--graph_max_batch_size 500: CUDA Graph maximum batch size, affects throughput and memory usage--mem_fraction 0.88: GPU memory usage ratio, recommended to set to 0.88 or below
CPU Cache Parameters#
--enable_cpu_cache: Enable CPU cache (L2 layer), the core parameter for enabling two-level cache--cpu_cache_storage_size 400: CPU cache capacity in GB, set to 400GB hereCapacity planning: Every 2GB can cache approximately 10K tokens of KV Cache (depending on model configuration)
Recommended to set to 50~60% of available system memory
For machines with 2TB memory, recommended to set to 1~1.2TB
--cpu_cache_token_page_size 64: CPU cache page size in number of tokensDefault value is 256, recommended range 64-512
Smaller page sizes (e.g., 64) are suitable for fine-grained cache management, reducing memory fragmentation and improving hit rates
Larger page sizes (e.g., 256) are suitable for bulk data migration, improving transfer efficiency
This value needs to balance memory utilization and transfer overhead
Performance Optimization Suggestions:
Using Hugepages: Execute the following commands and set the LIGHTLLM_HUGE_PAGE_ENABLE environment variable to enable huge page mode. Enabling huge page memory can significantly improve service startup speed. If you find the service takes too long to start, you can enable huge page mode for acceleration. Note that huge page mode will occupy memory space for the long term
sudo sed -i 's/^GRUB_CMDLINE_LINUX=\"/& default_hugepagesz=1G \ hugepagesz=1G hugepages={required_huge_page_capacity}/' /etc/default/grub sudo update-grub sudo reboot
2. L1 + L2 + L3 Three-Level Cache (GPU + CPU + Disk)#
Suitable for ultra-long text or extremely high-concurrency scenarios, providing maximum cache capacity.
Important
Using three-level cache requires installing the LightMem library first. Please refer to the “Prerequisites” section above to complete the installation.
Startup Command:
# Enable GPU + CPU + Disk three-level cache
LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/to/Qwen3-235B-A22B \
--tp 8 \
--graph_max_batch_size 500 \
--mem_fraction 0.88 \
--enable_cpu_cache \
--cpu_cache_storage_size 400 \
--cpu_cache_token_page_size 256 \
--enable_disk_cache \
--disk_cache_storage_size 1000 \
--disk_cache_dir /mnt/ssd/disk_cache_dir
Parameter Description:
Disk Cache Parameters#
In addition to the two-level cache, add the following parameters:
--enable_disk_cache: Enable disk cache (L3 layer), the core parameter for enabling three-level cache--disk_cache_storage_size 1000: Disk cache capacity in GB, set to 1TB hereCapacity planning: Every 2GB can cache approximately 10K tokens of KV Cache
Recommended to set based on storage space and business needs, typically ranging from hundreds of GB to several TB
1TB capacity can cache approximately 5M tokens of KV Cache
--disk_cache_dir /mnt/ssd/disk_cache_dir: Disk cache directory, specifying the directory for persisting cache dataIf not set, the system temporary directory will be used
Strongly recommended to use SSD/NVMe storage, avoid using HDD (performance difference can be 10-100x)
Ensure the directory has sufficient read/write permissions and disk space