Detoxify — toxic comment & content moderation classifier

Detoxify provides trained PyTorch models to detect toxic, obscene, threatening, insulting, and identity-attacking language — the models that placed in Jigsaw's three Toxic Comment Challenges. It works as a drop-in content-moderation guardrail for user- or model-generated text.

Key features

Three pretrained checkpoints: original, unbiased (bias-mitigated), and multilingual (7 languages)
Per-label probability scores for toxicity, severe toxicity, obscene, threat, insult, and identity attack
One-line inference API built on Hugging Face Transformers and PyTorch Lightning
The unbiased model is trained to reduce unintended identity bias for fairer moderation
Lightweight enough to run inline as an input/output filter for chat and comment pipelines

Detoxify gives you a fast, self-hostable toxicity gate you can place ahead of or behind an LLM, returning granular scores instead of a single opaque yes/no verdict.

Curated mirror of the open-source Detoxify (Apache-2.0). Get it from the source.

Detoxify

Detoxify — toxic comment & content moderation classifier

Key features

More from @ai-supply