Skip to content
ai-supply.store
DiscoverCategoriesLeaderboardsCommunityAgent APIFAQ
PublishSign in
catalog / Language & NLP / Databricks Dolly-15k
▣DatasetLanguage & NLPFree

Databricks Dolly-15k

CC-BY-SA-3.0 instruction dataset of 15k human-written prompts — the first commercially licensed open instruction dataset.

@ai-supply
Installs84k
Rating★ 4.5
Reviews28
Install (free) to download the source.↗ Source repository

Databricks Dolly-15k

Databricks Dolly-15k is a landmark open instruction-following dataset: 15,000 high-quality prompts and responses written entirely by Databricks employees — no GPT-4 distillation, no synthetic generation. Released under CC-BY-SA 3.0, it was the first large instruction dataset explicitly licensed for commercial use, sparking the open-source instruction-tuning wave.

Key features

  • 15,015 human-authored instruction/response pairs
  • 8 capability categories: brainstorming, classification, closed QA, generation, information extraction, open QA, summarization, creative writing
  • 100% human-written — no GPT distillation, legally cleaner than GPT-sourced alternatives
  • CC-BY-SA 3.0 — commercial use explicitly permitted
  • Used to train Dolly-v2-12b (Apache-2.0) and many community fine-tunes

Quick start

from datasets import load_dataset

ds = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"Rows: {len(ds)}")  # 15015
print(ds[0].keys())  # instruction, context, response, category

# Filter by category
brainstorm = ds.filter(lambda x: x["category"] == "brainstorming")
print(f"Brainstorming rows: {len(brainstorm)}")

Install via ai-supply

npx ai-supply add databricks-dolly-15k

Curated mirror of the open-source Databricks Dolly-15k (CC-BY-SA-3.0). Get it from the source.

More from @ai-supply

View profile →
◆Skill
OpenCV Python
The world's most popular computer vision library with Python bindings — image processing, video, and ML pipelines.
↓ 500k★ 4.9
◐Model
timm (PyTorch Image Models)
The largest collection of pretrained image models for PyTorch — ViT, ConvNeXt, EfficientNet, Swin, and 900+ more.
↓ 490k★ 4.9
⌬Workflow
Apache Airflow
Apache-2.0 workflow orchestration platform — define, schedule, and monitor data and AI pipelines as Python DAGs.
↓ 395k★ 4.7
◐Model
Segment Anything Model (SAM)
Meta AI's promptable image segmentation model that can segment any object from a single click or bounding box.
↓ 320k★ 4.9
ai-supply.store

The marketplace for AI capabilities. Skills, MCPs, plugins, agents, datasets — discoverable by humans, consumable by machines.

api · v3.1status · all green
Marketplace
  • Discover
  • Categories
  • Leaderboards
  • Benchmarks
Community
  • Community
  • FAQ
For agents
  • Quickstart (60s)
  • Authorize an agent
  • Agent API
  • OpenAPI spec
For builders
  • Publish
  • Dashboard
  • Revenue share
Account
  • Sign in
  • Settings
Legal
  • Terms
  • Publisher Agreement
  • Acceptable Use
  • Privacy