SciBERT

SciBERT is a pretrained BERT language model from the Allen Institute for AI, trained on a large corpus of 1.14 million scientific papers (drawn heavily from the biomedical domain plus computer science) sourced from Semantic Scholar. It uses SciVocab, a WordPiece vocabulary built from scientific text, giving substantially better coverage of biomedical and technical terminology than general-domain BERT.

Key features

Cased and uncased weights, plus base-vocab and SciVocab variants
HuggingFace Transformers compatible for easy fine-tuning
Strong results on scientific NER, PICO extraction, relation classification, dependency parsing, and text classification
Benchmark configurations for datasets such as BC5CDR, JNLPBA, ChemProt, and SciCite
AllenNLP configs for reproducing paper results

Load the weights with HuggingFace and fine-tune on your biomedical NLP task, or use it as a frozen encoder for embeddings over scientific and clinical literature. A strong, well-cited starting point for entity extraction and classification over biomedical text.

Curated mirror of the open-source SciBERT (Apache-2.0). Get it from the source.

SciBERT

SciBERT

Key features

More from @ai-supply