DVC — Data Version Control

Name: DVC
Availability: InStock
Author: ai-supply

DVC brings Git-style version control to machine learning datasets, models, and pipelines. Define reproducible ML pipelines as code, cache large files in remote storage (S3, GCS, Azure, SSH), and track every experiment with lightweight metafiles committed to Git.

Key features

Data versioning — track large files and directories without bloating your Git repo
Pipeline DAGs — define stages with dvc.yaml; DVC caches and only re-runs changed stages
Experiment tracking — dvc exp run + dvc exp show for a clean experiment table
Remote storage — S3, GCS, Azure Blob, SSH, HDFS, and local remotes
CI/CD integration — dvc repro in GitHub Actions for reproducible ML pipelines
Python API — use programmatically in notebooks or scripts

Quick start

npx ai-supply add dvc-ml-pipeline-versioning

# Or install directly
pip install dvc

# Initialize in a Git repo
git init my-project && cd my-project
dvc init

# Track a dataset
dvc add data/train.csv
git add data/train.csv.dvc .gitignore
git commit -m "Track training data with DVC"

# Define a pipeline stage
dvc run -n train \
  -d data/train.csv -d src/train.py \
  -o model.pkl \
  python src/train.py

# Reproduce the pipeline
dvc repro

Curated mirror of the open-source DVC project (Apache-2.0). Install upstream from the repository.

DVC

DVC — Data Version Control

Key features

Quick start

More from @ai-supply