◐ModelLanguage & NLPFree
llama.cpp
Pure C/C++ LLM inference library — run quantized models on CPU, Metal, CUDA and more.
Installs900k
Rating★ 4.9
Reviews300
llama.cpp
llama.cpp is a pure C/C++ port of Meta's LLaMA model inference, designed for maximum portability and performance across a wide variety of hardware — from MacBook laptops to cloud GPUs. It pioneered 4-bit quantization (GGUF format) that makes running large language models on consumer hardware practical.
Key Features
- GGUF format: the community standard for quantized LLM weights (4-bit, 5-bit, 8-bit, etc.)
- Cross-platform: macOS (Metal), Linux, Windows, iOS, Android, WebAssembly
- Multi-backend: CPU, CUDA, ROCm, Vulkan, OpenCL, SYCL
- OpenAI-compatible server built-in (
llama-server) - Python bindings via
llama-cpp-python - Supports Llama, Mistral, Phi, Gemma, Qwen, Falcon, Starcoder, and dozens more
Quick Start
# Build
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release -j
# Run inference
./build/bin/llama-cli -m model.gguf -p "Tell me about AI:"
# Or use the Python wrapper
pip install llama-cpp-python
Install via ai-supply
npx ai-supply add llama-cpp-cpu-inference
Curated mirror of the open-source llama.cpp (MIT). Get it from the source.