Field Guide Map
The full path from process ranks to scheduled GPU jobs.
This site is written as a teaching sequence. Start with the MPI model, move into runtime commands, then layer in accelerators and Slurm-based cluster execution.
Distributed Computing Handbook
Learn what MPI is, how the runtime commands work, where GPUs fit, and how Slurm launches jobs across connected machines in a real cluster workflow.
Field Guide Map
This site is written as a teaching sequence. Start with the MPI model, move into runtime commands, then layer in accelerators and Slurm-based cluster execution.
Processes, ranks, communicators, and why MPI dominates multi-node HPC.
What mpicc, mpirun, mpiexec, and srun actually do.
How distributed message passing and local accelerator kernels work together.
How allocations, job scripts, and scheduler-native launches shape production clusters.
Part 1
MPI stands for Message Passing Interface. It is the dominant standard for parallel programs that split work across multiple processes. Those processes may run on one server or across many servers connected by a network.
MPI itself is not a single compiler or a single command. It defines APIs and semantics for sending messages, synchronizing work, and building collective operations such as broadcast, reduce, gather, and scatter.
Threads share memory inside one machine. MPI scales beyond one machine by letting each process own its own memory and exchange data explicitly. This is why MPI remains central in weather models, CFD, molecular dynamics, seismic processing, and large scientific simulations.
MPI_COMM_WORLD, the default communicator containing all ranks.Bcast or Reduce.Part 2
A typical execution flow is: allocate resources, start multiple ranks, let each rank compute part of the problem, exchange data between ranks, then combine the result.
Every rank calls MPI_Init, joins a communicator, and learns its rank ID.
Each rank processes a local chunk, often based on rank number or domain decomposition.
Ranks exchange halos, partial sums, parameters, or checkpoints using point-to-point and collective calls.
All ranks cleanly terminate with MPI_Finalize after the distributed workflow ends.
Part 3
A practical MPI environment is not just code. It is a stack: scheduler, nodes, ranks, accelerators, and a fabric connecting them.
Node A
Node B
InfiniBand or Ethernet carries MPI messages, collectives, and synchronization traffic.
Part 4
MPI implementations ship compiler wrappers so your build automatically links the right headers and libraries.
mpicc hello.c -o hello
mpicxx solver.cpp -O3 -o solver
mpifort model.f90 -o model
Under the hood, these wrappers usually call GCC, Clang, Intel, NVHPC, or another compiler with MPI include and link flags.
mpirun or mpiexec starts multiple ranks. On Slurm systems, srun is also
common and often preferred because it integrates directly with the scheduler.
mpirun -np 8 ./hello
mpiexec -n 16 ./solver
srun -n 32 ./solver
mpirun -np 4 --host node1,node2 ./app
mpirun -np 8 --map-by ppr:4:node ./app
srun --ntasks=8 --nodes=2 ./app
srun --cpu-bind=cores ./app
Exact flags vary across Open MPI, MPICH, Intel MPI, and site-specific Slurm policy, so always check your cluster documentation too.
Part 5
hello.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello from rank %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}
compile and run
mpicc hello.c -o hello
mpirun -np 4 ./hello
Each process prints its own rank. That simple rank identity is the basis of almost every domain decomposition strategy in MPI.
Part 6
MPI handles communication between processes. GPUs handle massively parallel local computation inside each process. In practice, an HPC application often uses MPI between nodes and CUDA, HIP, OpenACC, OpenMP target, or SYCL inside a node.
One MPI rank is bound to one GPU, or one rank may coordinate several GPUs. The rank launches kernels, stores arrays in device memory, and exchanges boundary data with other ranks.
node A: rank 0 -> GPU 0
node A: rank 1 -> GPU 1
node B: rank 2 -> GPU 0
node B: rank 3 -> GPU 1
Some MPI stacks are GPU-aware, meaning they can send and receive buffers directly from device memory without forcing an explicit copy back to host memory first. This can reduce overhead and simplify code.
Part 7
CUDA_VISIBLE_DEVICES.Part 8
Slurm is the scheduler and resource manager. It decides where your job runs, reserves CPUs, memory, GPUs, and nodes, then launches tasks under those allocations.
MPI provides process communication. Slurm provides controlled access to shared cluster resources. In many production clusters, you do not directly pick random machines yourself. You ask Slurm for resources, then run your MPI job inside that allocation.
sinfo
squeue
salloc -N 2 -n 8
srun -n 8 ./solver
sbatch run_solver.slurm
scancel 123456
Part 9
#!/bin/bash
#SBATCH --job-name=mpi-cpu-demo
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:30:00
#SBATCH --partition=compute
module load openmpi
srun ./solver
#!/bin/bash
#SBATCH --job-name=mpi-gpu-demo
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=01:00:00
#SBATCH --partition=gpu
module load cuda
module load openmpi
srun --gpu-bind=single:1 ./gpu_solver
--nodes requests how many machines you need. --ntasks-per-node controls
MPI rank count per machine. --gpus-per-node reserves accelerators. srun
then launches the executable using the exact allocation that Slurm granted.
Part 10
MPI is a standard, but several implementations package it differently. The API looks similar, but launch behavior, tuning options, transport layers, and cluster integration differ in practice.
Part 11
Decide whether you want one rank per core, one rank per NUMA region, or one rank per GPU.
At scale, communication and synchronization can dominate runtime even if kernels are fast.
On managed clusters, srun usually behaves better than bypassing Slurm with ad hoc remote launch methods.
Not every MPI build supports direct device-buffer transfers. Verify your site’s MPI build flags and network stack.
Run on one node first, then multiple nodes, then mixed CPU and GPU allocations once correctness is stable.
Cluster environments often provide specific versions of CUDA, Open MPI, NCCL, compilers, and math libraries.
Part 12
This often points to a launch mismatch: wrong task count, firewall or host reachability issues, or the job starting outside the Slurm allocation model expected by the cluster.
The bottleneck may be host-device copies, too many ranks per GPU, or heavy communication between kernels. Measure data movement, not just kernel time.
Look for race-like logic bugs in halo exchange order, missing collectives, non-deterministic reductions, or domain decomposition mistakes that only appear when more ranks participate.
Part 13
Distributed-process communication across one or many machines.
mpicc, mpirun, mpiexec, and often srun on Slurm clusters.
Used for local acceleration while MPI coordinates the distributed pieces.
The scheduler that allocates nodes, CPUs, memory, and GPUs before launch.