Git for ML

Version control for ML projects is harder than for typical software because ML projects produce large binary artifacts (model checkpoints, datasets, embeddings) that do not belong in Git, and because experiments branch in ways that do not follow the usual feature-branch workflow. This chapter covers experiment branching strategies, large file management with Git LFS and DVC, proper .gitignore configuration, and Git workflows adapted for ML research.

Branching for Experiments

The key insight for ML experiment management: each experiment should be a branch, and the commit message should include the result. This turns git log into an experiment logbook:


# One branch per experiment
git checkout -b exp/lr-sweep
# Make changes to config, train, evaluate
git add config.yaml results/
git commit -m "exp: lr sweep [1e-5, 3e-4, 1e-3], best=3e-4, val_loss=0.42"

# Compare experiments
git diff exp/baseline..exp/lr-sweep -- config.yaml    # What changed?
git log --oneline --graph exp/baseline exp/lr-sweep    # Visual comparison

# View all experiments
git branch --list "exp/*"          # List experiment branches
git log --oneline --all --grep="exp:"  # All experiment commits

# Tag successful experiments
git tag -a v1.0-best -m "Best model: val_loss=0.42, lr=3e-4"

You have a working baseline on `exp/baseline` (val_loss=0.45) and want to test whether a higher learning rate helps. The goal is to isolate the change on its own branch, record the result in the commit, then compare it against the baseline. Each step below builds on the one before it, with less guidance as you go.

# 1. Branch off the baseline so the experiment is isolated
git checkout exp/baseline
git checkout -b exp/lr-3e-4

# 2. Make the one change you are testing, then train and evaluate
#    (edit config.yaml: learning_rate 1e-4 to 3e-4, then run train.py)

# 3. Commit the config and results with the metric in the message,
#    so git log doubles as the experiment record:
git add config.yaml results/
git commit -m "exp: lr=3e-4 -> val_loss=0.42 (better than baseline 0.45)"

# 4. The run beat the baseline, so tag it as a known-good checkpoint:
git tag -a v1.1-lr3e-4 -m "val_loss=0.42, lr=3e-4"

# 5. Confirm what actually differed and by how much:
git diff exp/baseline..exp/lr-3e-4 -- config.yaml
git log --oneline exp/baseline exp/lr-3e-4

Reading the final two commands together tells the whole story: the diff shows the single learning_rate line that changed, and the log shows two commits whose messages report 0.45 against 0.42. If the experiment had lost to the baseline, you would skip the tag in step 4 and leave the branch in place as a record of what did not work.

Pattern	Purpose	Example
`exp/<description>`	Experiment branch	`exp/lr-sweep`, `exp/larger-model`
`feat/<description>`	New feature	`feat/flash-attention`, `feat/data-pipeline`
`fix/<description>`	Bug fix	`fix/nan-loss`, `fix/dataloader-oom`
`refactor/<description>`	Code cleanup	`refactor/training-loop`
`data/<description>`	Data processing changes	`data/add-validation-split`

**Commit message format for experiments.** Include quantitative results in commit messages so `git log` serves as an experiment log:

exp: <what you tried> -> <key metric>=<value>

exp: lr=3e-4, bs=64, warmup=1000 -> val_loss=0.42, ppl=12.3
exp: add dropout=0.1 to attention -> val_loss=0.39 (3% improvement)
exp: switch to BF16 training -> same loss, 1.8x faster, 40% less memory

This makes it trivial to search experiment history: git log --grep="val_loss" --oneline

Tracking Large Files

ML projects generate large binary files: model checkpoints (hundreds of MB to hundreds of GB), datasets, and embeddings. These must not go in regular Git (it stores full copies of every version, bloating the repository):


# Setup Git LFS
git lfs install                    # One-time setup per user

# Track file patterns
git lfs track "*.pt"               # PyTorch model files
git lfs track "*.bin"              # Binary weight files
git lfs track "*.safetensors"      # SafeTensors format
git lfs track "*.onnx"             # ONNX models
git lfs track "data/*.parquet"     # Data files
git add .gitattributes             # LFS config stored here
git commit -m "Configure git-lfs tracking"

# Use normally (LFS handles the rest)
git add model.pt                   # Stored via LFS (pointer in Git, file on LFS server)
git commit -m "Add trained model checkpoint"
git push                           # Uploads to LFS server

# Inspect LFS state
git lfs ls-files                   # List tracked files
git lfs status                     # Show pending LFS operations
git lfs fetch --all                # Download all LFS files (useful after clone)

Tool	What It Stores	Where Files Live	Git Integration	Best For
Git LFS	Pointers in Git, files on LFS server	GitHub LFS, S3, or custom server	Transparent (just `git push`)	Model files, small datasets
DVC	Metadata in Git, files on remote storage	S3, GCS, Azure, SSH, local	Separate commands (`dvc push`)	Large datasets, pipelines
Weights & Biases	Artifacts and metadata	W&B cloud	API-based	Experiment artifacts
Hugging Face Hub	Models and datasets	HF Hub	Git-based (via `huggingface-cli`)	Sharing pretrained models

DVC (Data Version Control)

DVC extends Git with data and pipeline versioning. It stores small metadata files (.dvc) in Git while keeping the actual data on remote storage:


pip install dvc dvc-s3    # Install DVC with S3 backend

# Initialize DVC
dvc init
dvc remote add -d storage s3://my-bucket/dvc-store
git add .dvc .dvcignore
git commit -m "Initialize DVC"

# Track a dataset
dvc add data/train.parquet         # Creates data/train.parquet.dvc (metadata)
git add data/train.parquet.dvc data/.gitignore
git commit -m "data: add training dataset v1"
dvc push                           # Upload actual data to S3

# Reproduce pipeline
dvc repro                          # Re-run only changed stages

# Switch between data versions
git checkout v1.0                  # Switch to old code
dvc checkout                       # Restore matching data version


# dvc.yaml: define a reproducible pipeline (DAG of stages)
stages:
  preprocess:
    cmd: python preprocess.py --input data/raw/ --output data/processed/
    deps:                              # If any dep changes, stage re-runs
      - data/raw/
      - preprocess.py
    outs:                              # Outputs tracked by DVC
      - data/processed/

  train:
    cmd: python train.py --config config.yaml
    deps:
      - data/processed/
      - train.py
      - config.yaml
    outs:
      - models/best.pt
    metrics:                           # Tracked but not stored in DVC
      - metrics.json:
          cache: false
    plots:                             # Visualization data
      - plots/loss_curve.csv:
          cache: false

  evaluate:
    cmd: python evaluate.py --model models/best.pt --data data/test/
    deps:
      - models/best.pt
      - evaluate.py
      - data/test/
    metrics:
      - eval_metrics.json:
          cache: false

**DVC vs Git LFS: when to use which.** - **Git LFS**: Use for model checkpoints and small-to-medium binary files (< 10 GB total) that change infrequently. Simpler workflow (just `git push`). - **DVC**: Use for large datasets (10+ GB), data pipelines, and when you need to version data independently from code. More powerful but requires separate `dvc push/pull` commands. - **Both can coexist** in the same repository if needed.

.gitignore for ML Projects

A proper .gitignore prevents accidentally committing large files, credentials, or environment-specific artifacts:


# Data (use DVC or LFS for these)
data/raw/
data/processed/
*.parquet
*.csv
*.jsonl
*.h5
*.hdf5
*.tfrecord

# Models (use DVC or LFS)
*.pt
*.pth
*.bin
*.onnx
*.safetensors
checkpoints/
outputs/

# Logs and experiment tracking
wandb/
runs/
logs/
tb_logs/
*.log
mlruns/

# Python
.venv/
__pycache__/
*.pyc
*.pyo
*.egg-info/
dist/
build/
.eggs/

# Jupyter
.ipynb_checkpoints/

# Secrets (NEVER commit these)
.env
*.key
*.pem
credentials.json
secrets.yaml

# System
.DS_Store
Thumbs.db
*.swp
*.swo
*~

# IDE
.vscode/settings.json
.idea/
*.code-workspace

**Create your .gitignore first.** Before writing any code, set up `.gitignore`. It is much easier to prevent large files from entering the repository than to remove them after committing (removing a 10 GB model checkpoint from Git history requires `git filter-repo` (the modern recommended tool) or BFG Repo-Cleaner, which rewrites history).

Useful Git Workflows


# Stash changes while switching context
git stash                          # Save uncommitted changes
git checkout main
git checkout -b fix/nan-loss       # Fix a bug
git add . && git commit -m "fix: handle NaN in loss computation"
git checkout exp/my-experiment
git stash pop                      # Restore changes

# Cherry-pick a fix into your experiment
git cherry-pick abc123             # Apply specific commit to current branch

# Find which commit broke training
git bisect start
git bisect bad                     # Current commit is broken
git bisect good v1.0               # This tag was working
# Git checks out a midpoint commit; test and mark good/bad
# Repeat until the breaking commit is found
git bisect reset                   # Return to original state

# See what changed in a file over time
git log --oneline -p -- train.py   # All changes to train.py
git blame train.py                 # Who changed each line and when

# Clean up experiment branches
git branch --merged main | grep "exp/" | xargs git branch -d

Practice	Why
Commit early and often	Disk is cheap; lost work is expensive
Include results in commit messages	`git log` becomes an experiment logbook
Tag successful experiments	Easy to return to known-good states
Never commit large files without LFS/DVC	Repository bloat is permanent and painful to fix
Keep config separate from code	Change experiments without modifying Python files
Use `.gitignore` from day one	Prevention is 100x easier than cleanup
Sign your commits (`git commit -S`)	Required for some organizations; good practice

**Git vs dedicated experiment trackers.** Git with good commit messages gives you basic experiment tracking for free. For serious experiment management, complement Git with a dedicated tool: - **Weights & Biases**: Logs metrics, hyperparameters, artifacts, and system metrics automatically. Best for team collaboration. - **MLflow**: Open-source, self-hosted. Tracks experiments, packages models, supports deployment. - **TensorBoard**: Lightweight; PyTorch ships built-in integration via torch.utils.tensorboard. Best for loss curves and debugging training dynamics. - **DVC + Git**: Version data and pipelines alongside code. Best for reproducibility.

Use Git for code versioning, and a tracking tool for metrics, hyperparameters, and artifacts. They are complementary, not competing.

Branching for Experiments​

Tracking Large Files​

DVC (Data Version Control)​

.gitignore for ML Projects​

Useful Git Workflows​

Branching for Experiments

Tracking Large Files

DVC (Data Version Control)

.gitignore for ML Projects

Useful Git Workflows