SciBERT
BERT language model pretrained on 1.14M scientific papers with a domain-specific vocabulary, strong on biomedical NLP tasks.
SciBERT
SciBERT is a pretrained BERT language model from the Allen Institute for AI, trained on a large corpus of 1.14 million scientific papers (drawn heavily from the biomedical domain plus computer science) sourced from Semantic Scholar. It uses SciVocab, a WordPiece vocabulary built from scientific text, giving substantially better coverage of biomedical and technical terminology than general-domain BERT.
Key features
- Cased and uncased weights, plus base-vocab and SciVocab variants
- HuggingFace Transformers compatible for easy fine-tuning
- Strong results on scientific NER, PICO extraction, relation classification, dependency parsing, and text classification
- Benchmark configurations for datasets such as BC5CDR, JNLPBA, ChemProt, and SciCite
- AllenNLP configs for reproducing paper results
Load the weights with HuggingFace and fine-tune on your biomedical NLP task, or use it as a frozen encoder for embeddings over scientific and clinical literature. A strong, well-cited starting point for entity extraction and classification over biomedical text.
Curated mirror of the open-source SciBERT (Apache-2.0). Get it from the source.