△EvalLanguage & NLPFree
TruthfulQA
Benchmark of 817 questions across 38 categories measuring whether LLMs avoid imitative falsehoods — a direct probe of factual hallucination.
TruthfulQA — benchmark for truthfulness & imitative falsehoods
TruthfulQA measures whether a language model gives truthful answers to questions that humans often answer falsely because of common misconceptions — a direct probe of imitative hallucination rather than raw knowledge.
Key features
- 817 questions across 38 categories (health, law, finance, politics, conspiracies, and more) crafted to elicit false beliefs
- Two evaluation tracks: generation (free-form) and multiple-choice (MC1 / MC2)
- Truthfulness and informativeness scoring via the "GPT-judge" protocol and BLEURT/reference metrics
- Curated reference true/false answer sets for automated grading
- A widely cited yardstick for catching factuality regressions when fine-tuning or prompting
Because the questions are adversarially chosen to trap models that parrot popular falsehoods, TruthfulQA is a sharp regression test for hallucination — ideal for gating a model release on measured truthfulness.
Curated mirror of the open-source TruthfulQA (Apache-2.0). Get it from the source.