BEIR
Heterogeneous zero-shot information-retrieval benchmark bundling 15+ diverse IR datasets behind one evaluation API.
BEIR
BEIR (Benchmarking IR) is the standard for measuring how well a retriever generalizes zero-shot across domains it was never tuned on. Instead of overfitting to a single collection, it aggregates 15+ heterogeneous datasets — spanning fact-checking, question answering, bio-medical, scientific, financial, duplicate-detection, and news retrieval — into a common format with unified corpus/queries/qrels loaders and evaluation.
Key features
- 15+ ready-to-use retrieval datasets in one consistent schema
- Standardized nDCG@k, MAP, Recall, and Precision evaluation out of the box
- Compare BM25, dense bi-encoders, ColBERT, rerankers, and hybrid systems apples-to-apples
- Focus on zero-shot generalization, exposing where dense models quietly underperform lexical baselines
- Widely cited reference used to report embedding and retriever quality
Use it to sanity-check a new embedding model or reranker before shipping it into a RAG stack, so you know it holds up beyond your own domain.
Curated mirror of the open-source BEIR (Apache-2.0). Get it from the source.