Welcome to Lightllm! ==================== .. figure:: ./assets/logos/lightllm-logo.png :width: 100% :align: center :alt: Lightllm :class: no-scaled-link .. raw:: html

A Lightweight and High-Performance Large Language Model Service Framework

Lightllm is a pure Python-based large language model inference and serving framework, featuring lightweight design, easy extensibility, and high performance. Lightllm integrates the advantages of numerous open-source solutions, including but not limited to FasterTransformer, TGI, vLLM, SGLang, and FlashAttention. **Key Features**: * Multi-process Collaboration: Input text encoding, language model inference, visual model inference, and output decoding are performed asynchronously, significantly improving GPU utilization. * Cross-process Request Object Sharing: Through shared memory, cross-process request object sharing is achieved, reducing inter-process communication latency. * Efficient Scheduling Strategy: Peak memory scheduling strategy with prediction, maximizing GPU memory utilization while reducing request eviction. * High-performance Inference Backend: Efficient operator implementation, support for multiple parallelization methods (tensor parallelism, data parallelism, and expert parallelism), dynamic KV cache, rich quantization support (int8, fp8, int4), structured output, and multi-result prediction. Documentation List ------------------ .. toctree:: :maxdepth: 1 :caption: Quick Start Installation Guide Quick Start Performance Benchmark .. toctree:: :maxdepth: 1 :caption: Deployment Tutorials DeepSeek R1 Deployment Multimodal Deployment Reward Model Deployment OpenAI api Usage APIServer Parameters Lightllm API Introduction .. toctree:: :maxdepth: 1 :caption: Model Support Supported Models List Adding New Models .. toctree:: :maxdepth: 1 :caption: Architecture Introduction Architecture Overview Token Attention Efficient Router .. Indices and tables .. ================== .. * :ref:`genindex` .. * :ref:`modindex`