⇄ConnectorDevOps & InfraFree
vLLM
High-throughput, memory-efficient LLM inference engine with PagedAttention and continuous batching.
Installs820k
Rating★ 4.9
Reviews273
vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving. It achieves state-of-the-art serving throughput through PagedAttention — a novel attention algorithm that effectively manages attention key and value memory — combined with continuous batching of incoming requests and optimized CUDA kernels.
Key Features
- PagedAttention: near-zero KV cache waste, enabling 24× higher throughput than HuggingFace Transformers
- Continuous batching: dynamically schedules requests for maximum GPU utilization
- OpenAI-compatible REST API: drop-in replacement for OpenAI endpoints
- Quantization support: GPTQ, AWQ, SqueezeLLM, FP8
- Speculative decoding and chunked prefill
- Supports 100+ models: Llama, Mistral, Qwen, Phi, Gemma, and more
Quick Start
pip install vllm
# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct
# Query it
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
Install via ai-supply
npx ai-supply add vllm-high-throughput-inference
Curated mirror of the open-source vLLM (Apache-2.0). Get it from the source.