Skip to main content

Linux and Shell

ML engineering happens on Linux servers. Whether you are SSHing into a GPU cluster, managing training jobs with SLURM, or debugging a crashed process at 2 AM, fluency with the shell is a force multiplier. This chapter covers the essential commands, environment management, remote work patterns, GPU management, and job scheduling that ML engineers use daily.

Essential Commands


# ── File operations ──
ls -lah # List files with permissions, sizes, dates
du -sh * | sort -rh | head -20 # Top 20 directories by size
df -h # Free disk space per filesystem
find . -name "*.pt" -size +1G # Find model files larger than 1 GB
find . -name "*.py" -mtime -1 # Python files modified in last 24 hours
wc -l src/**/*.py # Count lines across all Python files

# ── Process management ──
ps aux | grep python # Find running Python processes
top -u $USER # Monitor your processes (CPU, memory)
htop # Interactive process viewer (better than top)
kill -SIGTERM PID # Graceful shutdown (allows cleanup)
kill -SIGKILL PID # Force kill (no cleanup, last resort)
nohup python train.py & # Run in background, survive logout
disown %1 # Detach job 1 from current shell

# ── Text processing ──
grep -rn "learning_rate" . # Search recursively with line numbers
grep -rn "lr" --include="*.py" . # Search only Python files
head -n 100 log.txt # First 100 lines
tail -f log.txt # Follow log output in real time
tail -n +1000 log.txt | head -100 # Lines 1000-1099
cat file.jsonl | python -m json.tool # Pretty-print JSON (no jq needed)
sort -t',' -k2 -n results.csv # Sort CSV by 2nd column numerically
awk -F',' '{sum+=$2} END {print sum/NR}' results.csv # Average of column 2

# ── Disk cleanup for ML ──
find . -name "__pycache__" -type d -exec rm -rf {} + # Remove pycache
find . -name "*.pyc" -delete # Remove compiled Python files
ShortcutAction
Ctrl+RSearch command history (reverse-i-search)
Ctrl+A / Ctrl+EJump to beginning / end of line
Ctrl+WDelete word before cursor
Ctrl+UDelete from cursor to beginning of line
Ctrl+LClear screen (same as clear)
Alt+.Insert last argument from previous command
!!Repeat last command (sudo !! is common)
!$Last argument of previous command
A training script died, but `nvidia-smi` still shows GPU 0 at 40 GB used and 0% utilization: a zombie process is holding the memory. Chain the process tools to reclaim it.
# 1. Confirm the GPU is occupied but idle
nvidia-smi # GPU 0: 40 GB used, 0% util

# 2. Find which processes hold the GPU device files
fuser -v /dev/nvidia* # prints PIDs touching the GPU

# 3. Confirm they are the dead run, not someone else's job
ps aux | grep python # match PID, user, and command

# 4. Stop them gracefully first, then forcefully if needed
kill -SIGTERM 48213 # ask the process to clean up
kill -SIGKILL 48213 # only if it ignores SIGTERM

# 5. Verify the memory is freed
nvidia-smi # GPU 0 should now read 0 MB used

If memory is still held after the process is gone (a rare driver-level hang), reset the single device with nvidia-smi --gpu-reset -i 0. Note that the reset requires root and fails unless the GPU is idle, so confirm no one else is running on it first.

Your training job crashes with `No space left on device` mid-checkpoint. The goal is to locate the largest offenders, confirm they are safe to delete, then reclaim space without touching live data. Each step below builds on the one before it, with less guidance as you go.
# 1. Which filesystem is full?
df -h # find the partition at ~100%

# 2. Within your project, which directories dominate? (top 20 by size)
du -sh * | sort -rh | head -20

# 3. Old checkpoints are usually the culprit. List large model files,
# confirm before deleting:
find . -name "*.pt" -size +1G

# 4. Reclaim cheap, regenerable space first:
find . -name "__pycache__" -type d -exec rm -rf {} +

Delete stale checkpoints last and only after confirming a newer one exists, since they are the one artifact you cannot regenerate cheaply.

Environment Management

Reproducible environments are essential for ML. A model trained in one environment should produce the same results in another:


# ── Conda (recommended for ML: handles CUDA, cuDNN, NCCL) ──
conda create -n ml python=3.11
conda activate ml
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
conda env export > environment.yml # Full export (platform-specific)
conda env export --from-history > env.yml # Only explicitly installed packages
conda env create -f environment.yml # Recreate environment

# ── pip with venv (lightweight, no CUDA management) ──
python -m venv .venv
source .venv/bin/activate
pip install torch
pip freeze > requirements.txt # Exact versions
pip install -r requirements.txt # Recreate

# ── UV (fast Rust-based pip replacement, 10-100x faster) ──
uv venv
source .venv/bin/activate
uv pip install torch # Fast install
uv pip compile requirements.in -o requirements.txt # Lock versions
uv pip sync requirements.txt # Install exactly what's locked

# ── Docker (full reproducibility including OS, CUDA driver) ──
# Use NVIDIA's base images for GPU support
# FROM nvcr.io/nvidia/pytorch:24.01-py3
ToolManages CUDA/cuDNNSpeedReproducibilityBest For
condaYes (from channel)Slow (minutes)Platform-specific exportGPU ML development
pip + venvNo (system CUDA)Mediumrequirements.txtSimple projects
uvNo (system CUDA)Very fast (seconds)Lock fileModern Python projects
DockerYes (NVIDIA base image)Slow (build), fast (pull)CompleteDeployment, cluster
pixiYes (conda-forge)Fast (Rust)Lock fileCross-platform ML
**Environment tips:** - **Pin exact versions** in `requirements.txt` for reproducibility: `torch==2.2.0`, not `torch>=2.0`. - **Use `pip install -e .`** for editable installs during development: changes to your code take effect immediately without reinstalling. - **Keep CUDA versions consistent**: `nvidia-smi` shows the driver's max supported CUDA, but `nvcc --version` shows the toolkit version. PyTorch must match the toolkit version. - **Export minimal specs**: `conda env export --from-history` or a hand-curated `requirements.txt` with only direct dependencies.

SSH and Remote Work

ML training happens on remote servers. Efficient remote work requires persistent sessions, port forwarding, and fast file transfer:


# ── SSH config (~/.ssh/config) for convenience ──
Host gpu-server
HostName 192.168.1.100
User alice
IdentityFile ~/.ssh/id_ed25519
ForwardAgent yes # Forward SSH keys (for git)
LocalForward 6006 localhost:6006 # TensorBoard
LocalForward 8888 localhost:8888 # Jupyter
ServerAliveInterval 60 # Keep connection alive
ServerAliveCountMax 5

Host gpu-cluster-*
ProxyJump bastion # Jump through bastion host
User alice

# Now just: ssh gpu-server

# ── File transfer ──
scp model.pt gpu-server:~/models/ # Copy single file
rsync -avz --progress ./data/ gpu-server:~/data/ # Sync directory (incremental)
rsync -avz --exclude='__pycache__' --exclude='.git' \
./project/ gpu-server:~/project/ # Sync code, skip junk

# ── Persistent sessions with tmux ──
tmux new -s train # Create named session
tmux attach -t train # Reattach after disconnect
tmux ls # List sessions
# Ctrl+B, D -> detach (session keeps running)
# Ctrl+B, [ -> scroll mode (q to exit)
# Ctrl+B, % -> split pane vertically
# Ctrl+B, " -> split pane horizontally
# Ctrl+B, z -> zoom/unzoom current pane
PatternPurpose
tmux new -s train then python train.pyTraining survives SSH disconnect
Split pane: train + watch nvidia-smiMonitor GPU while training
Split pane: train + tail -f log.txtMonitor logs while training
Multiple windows: code, train, evalOrganize different tasks
tmux send-keys -t train "python eval.py" EnterSend command to running session from outside

GPU Management


# ── Check GPU status ──
nvidia-smi # Snapshot of all GPUs
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free,\
utilization.gpu,temperature.gpu --format=csv # Structured output

# ── Continuous monitoring ──
watch -n 1 nvidia-smi # Update every second
nvidia-smi dmon -s u -d 1 # GPU utilization every second
nvitop # Interactive GPU monitor (pip install nvitop)

# ── Control GPU visibility ──
export CUDA_VISIBLE_DEVICES=0,1 # Only GPUs 0 and 1 visible
CUDA_VISIBLE_DEVICES=2 python train.py # Single-command override
CUDA_VISIBLE_DEVICES="" python test.py # Force CPU only

# ── Check CUDA/driver compatibility ──
nvidia-smi # Shows max supported CUDA version
nvcc --version # Shows installed toolkit version
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"

# ── Reset GPU (if stuck) ──
nvidia-smi --gpu-reset -i 0 # Reset GPU 0 (needs root; fails if GPU is busy)
fuser -v /dev/nvidia* # Find processes using GPUs
**Important GPU environment variables:**
VariablePurposeExample
CUDA_VISIBLE_DEVICESRestrict visible GPUs0,1,2,3
NCCL_DEBUGNCCL logging levelINFO, WARN
NCCL_P2P_DISABLEDisable P2P (PCIe issues)1
TORCH_CUDA_ARCH_LISTTarget GPU architectures for compilation8.0;9.0
PYTORCH_CUDA_ALLOC_CONFTune CUDA allocatorexpandable_segments:True
CUDA_LAUNCH_BLOCKINGSynchronous kernel launch (debugging)1

SLURM (Cluster Job Scheduling)

SLURM manages GPU cluster resources. You submit jobs as scripts; SLURM schedules them when resources are available:


#!/bin/bash
#SBATCH --job-name=llama-train
#SBATCH --nodes=2 # 2 nodes
#SBATCH --ntasks-per-node=1 # 1 task per node (torchrun handles GPUs)
#SBATCH --gpus-per-node=4 # 4 GPUs per node
#SBATCH --cpus-per-task=32 # 32 CPU cores per task (for DataLoader)
#SBATCH --mem=256G # RAM per node
#SBATCH --time=24:00:00 # Max runtime
#SBATCH --partition=gpu # GPU partition
#SBATCH --output=logs/%j_%x.out # stdout (%j=job_id, %x=job_name)
#SBATCH --error=logs/%j_%x.err # stderr
#SBATCH --signal=SIGUSR1@120 # Signal 120s before timeout (for checkpointing)

# Setup
module load cuda/12.1
source activate ml
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -1)
export MASTER_PORT=29500

# Handle preemption/timeout: save checkpoint on SIGUSR1
trap '[ -n "$PID" ] && echo "Received SIGUSR1, saving checkpoint..." && kill -SIGUSR1 $PID' SIGUSR1

# Launch distributed training
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train.py \
--resume_from_checkpoint latest &

PID=$!
wait $PID

# ── Job management ──
sbatch train.sh # Submit job
squeue -u $USER # Check your jobs
squeue -u $USER -o "%.10i %.30j %.8T %.10M %.6D %R" # Custom format
scancel JOB_ID # Cancel a job
scancel -u $USER # Cancel all your jobs

# ── Job information ──
sacct -j JOB_ID --format=JobID,Elapsed,MaxRSS,MaxVMSize,State # Job stats
scontrol show job JOB_ID # Detailed job info
seff JOB_ID # Job efficiency (CPU, memory usage)

# ── Cluster information ──
sinfo -p gpu # Check GPU partition availability
sinfo -p gpu -o "%20N %10c %10m %25G %20T" # Node details
squeue -p gpu | wc -l # Number of queued jobs

# ── Interactive session ──
srun --partition=gpu --gpus=1 --cpus-per-task=8 --mem=64G \
--time=4:00:00 --pty bash # Interactive GPU session
PracticeWhy
Save checkpoints every N stepsResume after preemption/timeout without losing work
Use --signal=SIGUSR1@120Get 2-minute warning before timeout to save final checkpoint
Set --output=logs/%j.outUnique log per job; %j = job ID
Request exact resources neededOver-requesting reduces scheduling priority
Use --array=1-10 for sweepsSubmit 10 jobs with different hyperparameters
Test with --time=0:30:00 firstShort test run to catch errors before 24h job
Use --dependency=afterok:JOB_IDChain jobs (e.g., train then evaluate)
**SLURM job arrays for hyperparameter sweeps.** Use `--array` to submit multiple jobs that differ by one variable:
#SBATCH --array=1-5
# In train.py, read SLURM_ARRAY_TASK_ID to select hyperparameters
LEARNING_RATES=(1e-5 3e-5 1e-4 3e-4 1e-3)
LR=${LEARNING_RATES[$SLURM_ARRAY_TASK_ID - 1]}
python train.py --lr $LR