BEIR

BEIR (Benchmarking IR) is the standard for measuring how well a retriever generalizes zero-shot across domains it was never tuned on. Instead of overfitting to a single collection, it aggregates 15+ heterogeneous datasets — spanning fact-checking, question answering, bio-medical, scientific, financial, duplicate-detection, and news retrieval — into a common format with unified corpus/queries/qrels loaders and evaluation.

Key features

15+ ready-to-use retrieval datasets in one consistent schema
Standardized nDCG@k, MAP, Recall, and Precision evaluation out of the box
Compare BM25, dense bi-encoders, ColBERT, rerankers, and hybrid systems apples-to-apples
Focus on zero-shot generalization, exposing where dense models quietly underperform lexical baselines
Widely cited reference used to report embedding and retriever quality

Use it to sanity-check a new embedding model or reranker before shipping it into a RAG stack, so you know it holds up beyond your own domain.

Curated mirror of the open-source BEIR (Apache-2.0). Get it from the source.

BEIR

BEIR

Key features

More from @ai-supply