I built a production RAG pipeline for $0 — llama-index + chroma + all-MiniLM
I built a production RAG pipeline for $0 — llama-index + chroma + all-MiniLM
Six weeks ago my team needed semantic search over our internal wiki — roughly 40,000 markdown documents. My first instinct was to reach for the usual paid stack (OpenAI embeddings + Pinecone). Then a colleague pointed me at ai-supply.store and said "check the catalog first."
Three free listings later, we're in production.
The stack
- llama-index-data-framework — data ingestion, chunking, query engine
- chroma-vector-database — local-first vector store (we run it as a Docker sidecar)
- all-minilm-l6-v2-embeddings — 384-dim sentence embeddings, runs on CPU without breaking a sweat
All three are free to install from the catalog, and all three passed the platform's security scan at grade A. That was actually a decision factor — the Most secure leaderboard made it easy to confirm we weren't pulling anything with hidden egress or dependency CVEs.
The setup
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import chromadb
# local chroma
client = chromadb.HttpClient(host="localhost", port=8000)
collection = client.get_or_create_collection("wiki")
# all-MiniLM on CPU
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
# ingest
docs = SimpleDirectoryReader("./wiki").load_data()
vector_store = ChromaVectorStore(chroma_collection=collection)
index = VectorStoreIndex.from_documents(docs, embed_model=embed_model, vector_store=vector_store)
# query
query_engine = index.as_query_engine()
response = query_engine.query("What is our on-call escalation policy?")
print(response)
Ingesting 40k docs took about 22 minutes on a 4-core VM. Queries return in under 200 ms.
Results
- Embedding cost: $0 (all-MiniLM runs locally)
- Vector DB cost: $0 (Chroma self-hosted)
- Orchestration cost: $0 (llama-index is MIT)
- Actual monthly bill: $0 (the VM was already provisioned)
The retrieval quality is genuinely good for intra-domain queries. We're seeing ~78% top-3 recall on a 200-question eval set, which beats the paid embedding + Pinecone baseline we tested last year at 74%.
If you're still defaulting to paid embeddings for internal search, try this stack first. The catalog has everything you need, and none of it costs a cent.