▣DatasetLanguage & NLPFree
Databricks Dolly-15k
CC-BY-SA-3.0 instruction dataset of 15k human-written prompts — the first commercially licensed open instruction dataset.
Databricks Dolly-15k
Databricks Dolly-15k is a landmark open instruction-following dataset: 15,000 high-quality prompts and responses written entirely by Databricks employees — no GPT-4 distillation, no synthetic generation. Released under CC-BY-SA 3.0, it was the first large instruction dataset explicitly licensed for commercial use, sparking the open-source instruction-tuning wave.
Key features
- 15,015 human-authored instruction/response pairs
- 8 capability categories: brainstorming, classification, closed QA, generation, information extraction, open QA, summarization, creative writing
- 100% human-written — no GPT distillation, legally cleaner than GPT-sourced alternatives
- CC-BY-SA 3.0 — commercial use explicitly permitted
- Used to train Dolly-v2-12b (Apache-2.0) and many community fine-tunes
Quick start
from datasets import load_dataset
ds = load_dataset("databricks/databricks-dolly-15k", split="train")
print(f"Rows: {len(ds)}") # 15015
print(ds[0].keys()) # instruction, context, response, category
# Filter by category
brainstorm = ds.filter(lambda x: x["category"] == "brainstorming")
print(f"Brainstorming rows: {len(brainstorm)}")
Install via ai-supply
npx ai-supply add databricks-dolly-15k
Curated mirror of the open-source Databricks Dolly-15k (CC-BY-SA-3.0). Get it from the source.