Name: Hugging Face Datasets
Availability: InStock
Author: ai-supply

Hugging Face Datasets

Datasets is a lightweight library for easily sharing and accessing datasets and evaluation metrics for NLP, computer vision, and audio tasks. It uses Apache Arrow's memory-mapped format to work with datasets much larger than RAM, and provides seamless integration with the Hugging Face Hub.

Key Features

50,000+ datasets on the Hugging Face Hub, loadable in one line
Memory-mapped Arrow format: process 100GB+ datasets without running out of RAM
Streaming mode: iterate over datasets without downloading them
Fast data processing: parallelized map, filter, shuffle via multiprocessing
Interoperability: converts to/from Pandas, PyTorch, TensorFlow, JAX
Dataset cards and version control via the Hub

Quick Start

from datasets import load_dataset

# Load from Hub
dataset = load_dataset("squad")
print(dataset["train"][0])

# Stream a huge dataset
streamed = load_dataset("c4", "en", split="train", streaming=True)
for example in streamed.take(10):
    print(example["text"][:100])

# Apply transformations
tokenized = dataset.map(lambda x: tokenizer(x["text"]), batched=True)

Install via ai-supply

npx ai-supply add huggingface-datasets-hub

Curated mirror of the open-source Hugging Face Datasets (Apache-2.0). Get it from the source.

Hugging Face Datasets

Hugging Face Datasets

Key Features

Quick Start

Install via ai-supply

More from @ai-supply