⛨GuardrailLanguage & NLPFree
Detoxify
Pretrained PyTorch models that score text for toxicity, threats, insults, and identity attacks — a drop-in content-moderation guardrail.
Detoxify — toxic comment & content moderation classifier
Detoxify provides trained PyTorch models to detect toxic, obscene, threatening, insulting, and identity-attacking language — the models that placed in Jigsaw's three Toxic Comment Challenges. It works as a drop-in content-moderation guardrail for user- or model-generated text.
Key features
- Three pretrained checkpoints:
original,unbiased(bias-mitigated), andmultilingual(7 languages) - Per-label probability scores for toxicity, severe toxicity, obscene, threat, insult, and identity attack
- One-line inference API built on Hugging Face Transformers and PyTorch Lightning
- The
unbiasedmodel is trained to reduce unintended identity bias for fairer moderation - Lightweight enough to run inline as an input/output filter for chat and comment pipelines
Detoxify gives you a fast, self-hostable toxicity gate you can place ahead of or behind an LLM, returning granular scores instead of a single opaque yes/no verdict.
Curated mirror of the open-source Detoxify (Apache-2.0). Get it from the source.