Welcome to Lightllm!

Welcome to Lightllm!#

Lightllm

A Lightweight and High-Performance Large Language Model Service Framework

Star Watch Fork

Lightllm is a pure Python-based large language model inference and serving framework, featuring lightweight design, easy extensibility, and high performance. Lightllm integrates the advantages of numerous open-source solutions, including but not limited to FasterTransformer, TGI, vLLM, SGLang, and FlashAttention.

Key Features:

  • Multi-process Collaboration: Input text encoding, language model inference, visual model inference, and output decoding are performed asynchronously, significantly improving GPU utilization.

  • Cross-process Request Object Sharing: Through shared memory, cross-process request object sharing is achieved, reducing inter-process communication latency.

  • Efficient Scheduling Strategy: Peak memory scheduling strategy with prediction, maximizing GPU memory utilization while reducing request eviction.

  • High-performance Inference Backend: Efficient operator implementation, support for multiple parallelization methods (tensor parallelism, data parallelism, and expert parallelism), dynamic KV cache, rich quantization support (int8, fp8, int4), structured output, and multi-result prediction.

Documentation List#