⬡PipelineDevOps & InfraFree
DVC
Git-like version control for ML datasets and pipelines — track experiments, reproduce results, and collaborate on data science projects.
DVC — Data Version Control
DVC brings Git-style version control to machine learning datasets, models, and pipelines. Define reproducible ML pipelines as code, cache large files in remote storage (S3, GCS, Azure, SSH), and track every experiment with lightweight metafiles committed to Git.
Key features
- Data versioning — track large files and directories without bloating your Git repo
- Pipeline DAGs — define stages with
dvc.yaml; DVC caches and only re-runs changed stages - Experiment tracking —
dvc exp run+dvc exp showfor a clean experiment table - Remote storage — S3, GCS, Azure Blob, SSH, HDFS, and local remotes
- CI/CD integration —
dvc reproin GitHub Actions for reproducible ML pipelines - Python API — use programmatically in notebooks or scripts
Quick start
npx ai-supply add dvc-ml-pipeline-versioning
# Or install directly
pip install dvc
# Initialize in a Git repo
git init my-project && cd my-project
dvc init
# Track a dataset
dvc add data/train.csv
git add data/train.csv.dvc .gitignore
git commit -m "Track training data with DVC"
# Define a pipeline stage
dvc run -n train \
-d data/train.csv -d src/train.py \
-o model.pkl \
python src/train.py
# Reproduce the pipeline
dvc repro
Curated mirror of the open-source DVC project (Apache-2.0). Install upstream from the repository.