▣DatasetLanguage & NLPFree
Hugging Face Datasets
Fast, memory-mapped dataset library for NLP and ML — 50,000+ datasets, streaming, and Arrow-backed processing.
Installs680k
Rating★ 4.8
Reviews227
Hugging Face Datasets
Datasets is a lightweight library for easily sharing and accessing datasets and evaluation metrics for NLP, computer vision, and audio tasks. It uses Apache Arrow's memory-mapped format to work with datasets much larger than RAM, and provides seamless integration with the Hugging Face Hub.
Key Features
- 50,000+ datasets on the Hugging Face Hub, loadable in one line
- Memory-mapped Arrow format: process 100GB+ datasets without running out of RAM
- Streaming mode: iterate over datasets without downloading them
- Fast data processing: parallelized
map,filter,shufflevia multiprocessing - Interoperability: converts to/from Pandas, PyTorch, TensorFlow, JAX
- Dataset cards and version control via the Hub
Quick Start
from datasets import load_dataset
# Load from Hub
dataset = load_dataset("squad")
print(dataset["train"][0])
# Stream a huge dataset
streamed = load_dataset("c4", "en", split="train", streaming=True)
for example in streamed.take(10):
print(example["text"][:100])
# Apply transformations
tokenized = dataset.map(lambda x: tokenizer(x["text"]), batched=True)
Install via ai-supply
npx ai-supply add huggingface-datasets-hub
Curated mirror of the open-source Hugging Face Datasets (Apache-2.0). Get it from the source.