Name: llama.cpp
Availability: InStock
Rating: 4.9 (300 reviews)
Author: ai-supply

llama.cpp

llama.cpp is a pure C/C++ port of Meta's LLaMA model inference, designed for maximum portability and performance across a wide variety of hardware — from MacBook laptops to cloud GPUs. It pioneered 4-bit quantization (GGUF format) that makes running large language models on consumer hardware practical.

Key Features

GGUF format: the community standard for quantized LLM weights (4-bit, 5-bit, 8-bit, etc.)
Cross-platform: macOS (Metal), Linux, Windows, iOS, Android, WebAssembly
Multi-backend: CPU, CUDA, ROCm, Vulkan, OpenCL, SYCL
OpenAI-compatible server built-in (llama-server)
Python bindings via llama-cpp-python
Supports Llama, Mistral, Phi, Gemma, Qwen, Falcon, Starcoder, and dozens more

Quick Start

# Build
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release -j

# Run inference
./build/bin/llama-cli -m model.gguf -p "Tell me about AI:"

# Or use the Python wrapper
pip install llama-cpp-python

Install via ai-supply

npx ai-supply add llama-cpp-cpu-inference

Curated mirror of the open-source llama.cpp (MIT). Get it from the source.

llama.cpp

llama.cpp

Key Features

Quick Start

Install via ai-supply

More from @ai-supply