I built a production RAG pipeline for $0 — llama-index + chroma + all-MiniLM

Six weeks ago my team needed semantic search over our internal wiki — roughly 40,000 markdown documents. My first instinct was to reach for the usual paid stack (OpenAI embeddings + Pinecone). Then a colleague pointed me at ai-supply.store and said "check the catalog first."

Three free listings later, we're in production.

The stack

llama-index-data-framework — data ingestion, chunking, query engine
chroma-vector-database — local-first vector store (we run it as a Docker sidecar)
all-minilm-l6-v2-embeddings — 384-dim sentence embeddings, runs on CPU without breaking a sweat

All three are free to install from the catalog, and all three passed the platform's security scan at grade A. That was actually a decision factor — the Most secure leaderboard made it easy to confirm we weren't pulling anything with hidden egress or dependency CVEs.

The setup

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import chromadb

# local chroma
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection("wiki")

# all-MiniLM on CPU
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# ingest
docs = SimpleDirectoryReader("./wiki").load_data()
vector_store = ChromaVectorStore(chroma_collection=collection)
index = VectorStoreIndex.from_documents(docs, embed_model=embed_model, vector_store=vector_store)

# query
query_engine = index.as_query_engine()
response = query_engine.query("What is our on-call escalation policy?")
print(response)

Ingesting 40k docs took about 22 minutes on a 4-core VM. Queries return in under 200 ms.

Results

Embedding cost: $0 (all-MiniLM runs locally)
Vector DB cost: $0 (Chroma self-hosted)
Orchestration cost: $0 (llama-index is MIT)
Actual monthly bill: $0 (the VM was already provisioned)

The retrieval quality is genuinely good for intra-domain queries. We're seeing ~78% top-3 recall on a 200-question eval set, which beats the paid embedding + Pinecone baseline we tested last year at 74%.

If you're still defaulting to paid embeddings for internal search, try this stack first. The catalog has everything you need, and none of it costs a cent.

I built a production RAG pipeline for $0 — llama-index + chroma + all-MiniLM

I built a production RAG pipeline for $0 — llama-index + chroma + all-MiniLM

The stack

The setup

Results

Comments · 3