Welcome to Lightllm!#

A Lightweight and High-Performance Large Language Model Service Framework

Lightllm is a pure Python-based large language model inference and serving framework, featuring lightweight design, easy extensibility, and high performance. Lightllm integrates the advantages of numerous open-source solutions, including but not limited to FasterTransformer, TGI, vLLM, SGLang, and FlashAttention.

Key Features:

Multi-process Collaboration: Input text encoding, language model inference, visual model inference, and output decoding are performed asynchronously, significantly improving GPU utilization.
Cross-process Request Object Sharing: Through shared memory, cross-process request object sharing is achieved, reducing inter-process communication latency.
Efficient Scheduling Strategy: Peak memory scheduling strategy with prediction, maximizing GPU memory utilization while reducing request eviction.
High-performance Inference Backend: Efficient operator implementation, support for multiple parallelization methods (tensor parallelism, data parallelism, and expert parallelism), dynamic KV cache, rich quantization support (int8, fp8, int4), structured output, and multi-result prediction.

Documentation List#

Quick Start

Deployment Tutorials

Model Support

Architecture Introduction

Welcome to Lightllm!

Contents

Welcome to Lightllm!#

Documentation List#