Name: vLLM
Availability: InStock
Author: ai-supply

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. It achieves state-of-the-art serving throughput through PagedAttention — a novel attention algorithm that effectively manages attention key and value memory — combined with continuous batching of incoming requests and optimized CUDA kernels.

Key Features

PagedAttention: near-zero KV cache waste, enabling 24× higher throughput than HuggingFace Transformers
Continuous batching: dynamically schedules requests for maximum GPU utilization
OpenAI-compatible REST API: drop-in replacement for OpenAI endpoints
Quantization support: GPTQ, AWQ, SqueezeLLM, FP8
Speculative decoding and chunked prefill
Supports 100+ models: Llama, Mistral, Qwen, Phi, Gemma, and more

Quick Start

pip install vllm

# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct

# Query it
curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Install via ai-supply

npx ai-supply add vllm-high-throughput-inference

Curated mirror of the open-source vLLM (Apache-2.0). Get it from the source.

vLLM

vLLM

Key Features

Quick Start

Install via ai-supply

More from @ai-supply