Distributed Computing Handbook

MPI from first principles to GPU clusters under Slurm.

Learn what MPI is, how the runtime commands work, where GPUs fit, and how Slurm launches jobs across connected machines in a real cluster workflow.

Field Guide Map

The full path from process ranks to scheduled GPU jobs.

This site is written as a teaching sequence. Start with the MPI model, move into runtime commands, then layer in accelerators and Slurm-based cluster execution.

01

Mental model

Processes, ranks, communicators, and why MPI dominates multi-node HPC.

02

Runtime tools

What mpicc, mpirun, mpiexec, and srun actually do.

03

GPU integration

How distributed message passing and local accelerator kernels work together.

04

Slurm workflows

How allocations, job scripts, and scheduler-native launches shape production clusters.

Part 1

What MPI actually is

MPI is a communication standard

MPI stands for Message Passing Interface. It is the dominant standard for parallel programs that split work across multiple processes. Those processes may run on one server or across many servers connected by a network.

MPI itself is not a single compiler or a single command. It defines APIs and semantics for sending messages, synchronizing work, and building collective operations such as broadcast, reduce, gather, and scatter.

Why HPC systems rely on it

Threads share memory inside one machine. MPI scales beyond one machine by letting each process own its own memory and exchange data explicitly. This is why MPI remains central in weather models, CFD, molecular dynamics, seismic processing, and large scientific simulations.

Core vocabulary

Rank
An individual MPI process identity inside a communicator.
World
MPI_COMM_WORLD, the default communicator containing all ranks.
Node
A physical or virtual machine participating in the job.
Collective
An operation involving a group of ranks, such as Bcast or Reduce.

Part 2

How an MPI program runs

A typical execution flow is: allocate resources, start multiple ranks, let each rank compute part of the problem, exchange data between ranks, then combine the result.

01

Initialize

Every rank calls MPI_Init, joins a communicator, and learns its rank ID.

02

Partition work

Each rank processes a local chunk, often based on rank number or domain decomposition.

03

Communicate

Ranks exchange halos, partial sums, parameters, or checkpoints using point-to-point and collective calls.

04

Finalize

All ranks cleanly terminate with MPI_Finalize after the distributed workflow ends.

Part 3

How the cluster fits together

A practical MPI environment is not just code. It is a stack: scheduler, nodes, ranks, accelerators, and a fabric connecting them.

Slurm Control
Allocates nodes, tasks, CPUs, memory, and GPUs.

Node A

Rank 0 CPU work GPU 0
Rank 1 CPU work GPU 1

Node B

Rank 2 CPU work GPU 0
Rank 3 CPU work GPU 1
Interconnect fabric

InfiniBand or Ethernet carries MPI messages, collectives, and synchronization traffic.

Part 4

Command-line tools you actually use

Build wrappers

MPI implementations ship compiler wrappers so your build automatically links the right headers and libraries.

mpicc hello.c -o hello
mpicxx solver.cpp -O3 -o solver
mpifort model.f90 -o model

Under the hood, these wrappers usually call GCC, Clang, Intel, NVHPC, or another compiler with MPI include and link flags.

Launch commands

mpirun or mpiexec starts multiple ranks. On Slurm systems, srun is also common and often preferred because it integrates directly with the scheduler.

mpirun -np 8 ./hello
mpiexec -n 16 ./solver
srun -n 32 ./solver

Useful runtime flags

mpirun -np 4 --host node1,node2 ./app
mpirun -np 8 --map-by ppr:4:node ./app
srun --ntasks=8 --nodes=2 ./app
srun --cpu-bind=cores ./app

Exact flags vary across Open MPI, MPICH, Intel MPI, and site-specific Slurm policy, so always check your cluster documentation too.

Part 5

A minimal MPI example

hello.c

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv) {
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    printf("Hello from rank %d of %d\n", rank, size);

    MPI_Finalize();
    return 0;
}

compile and run

mpicc hello.c -o hello
mpirun -np 4 ./hello

Each process prints its own rank. That simple rank identity is the basis of almost every domain decomposition strategy in MPI.

Part 6

Can MPI use GPUs?

Yes, but MPI is not the GPU kernel layer

MPI handles communication between processes. GPUs handle massively parallel local computation inside each process. In practice, an HPC application often uses MPI between nodes and CUDA, HIP, OpenACC, OpenMP target, or SYCL inside a node.

Common hybrid pattern

One MPI rank is bound to one GPU, or one rank may coordinate several GPUs. The rank launches kernels, stores arrays in device memory, and exchanges boundary data with other ranks.

node A: rank 0 -> GPU 0
node A: rank 1 -> GPU 1
node B: rank 2 -> GPU 0
node B: rank 3 -> GPU 1

GPU-aware MPI

Some MPI stacks are GPU-aware, meaning they can send and receive buffers directly from device memory without forcing an explicit copy back to host memory first. This can reduce overhead and simplify code.

Part 7

Practical GPU workflow with MPI

Stage
What happens
Select device
Each rank chooses a GPU, often based on local rank or environment variables like CUDA_VISIBLE_DEVICES.
Allocate device buffers
Simulation state or tensor data is placed on the GPU for kernel execution.
Compute locally
The rank launches GPU kernels for the local partition of the domain.
Exchange boundaries
MPI sends halo regions or partial results to neighboring ranks, ideally using GPU-aware transfers if available.
Reduce results
MPI collectives combine distributed outputs, for example total error, max residual, or global timestep constraints.

Part 8

Where Slurm fits in a cluster

Slurm is the scheduler and resource manager. It decides where your job runs, reserves CPUs, memory, GPUs, and nodes, then launches tasks under those allocations.

Key idea

MPI provides process communication. Slurm provides controlled access to shared cluster resources. In many production clusters, you do not directly pick random machines yourself. You ask Slurm for resources, then run your MPI job inside that allocation.

Typical Slurm commands

sinfo
squeue
salloc -N 2 -n 8
srun -n 8 ./solver
sbatch run_solver.slurm
scancel 123456

Part 9

Example Slurm scripts

CPU-only MPI job

#!/bin/bash
#SBATCH --job-name=mpi-cpu-demo
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:30:00
#SBATCH --partition=compute

module load openmpi
srun ./solver

MPI + GPU job

#!/bin/bash
#SBATCH --job-name=mpi-gpu-demo
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=01:00:00
#SBATCH --partition=gpu

module load cuda
module load openmpi

srun --gpu-bind=single:1 ./gpu_solver

Reading the directives

--nodes requests how many machines you need. --ntasks-per-node controls MPI rank count per machine. --gpus-per-node reserves accelerators. srun then launches the executable using the exact allocation that Slurm granted.

Part 10

Which MPI implementation are you using?

MPI is a standard, but several implementations package it differently. The API looks similar, but launch behavior, tuning options, transport layers, and cluster integration differ in practice.

Implementation
Typical use and notes
Open MPI
Common in research clusters, flexible transports, many runtime flags, and broad support across Linux HPC environments.
MPICH
Reference-family implementation with a clean design; often used as the base for vendor-tuned derivatives.
Intel MPI
Optimized for Intel-heavy environments and often integrated with oneAPI toolchains and Intel-centric cluster stacks.
MVAPICH
Popular on high-performance interconnects, especially where tuned InfiniBand performance matters.

Part 11

Practical advice before you scale up

Match rank layout to hardware

Decide whether you want one rank per core, one rank per NUMA region, or one rank per GPU.

Measure communication cost

At scale, communication and synchronization can dominate runtime even if kernels are fast.

Prefer scheduler-native launches

On managed clusters, srun usually behaves better than bypassing Slurm with ad hoc remote launch methods.

Check GPU awareness

Not every MPI build supports direct device-buffer transfers. Verify your site’s MPI build flags and network stack.

Start small

Run on one node first, then multiple nodes, then mixed CPU and GPU allocations once correctness is stable.

Use the cluster modules

Cluster environments often provide specific versions of CUDA, Open MPI, NCCL, compilers, and math libraries.

Part 12

Common failures and what they usually mean

Job hangs at startup

This often points to a launch mismatch: wrong task count, firewall or host reachability issues, or the job starting outside the Slurm allocation model expected by the cluster.

GPU code is slower than expected

The bottleneck may be host-device copies, too many ranks per GPU, or heavy communication between kernels. Measure data movement, not just kernel time.

Wrong answers only at scale

Look for race-like logic bugs in halo exchange order, missing collectives, non-deterministic reductions, or domain decomposition mistakes that only appear when more ranks participate.

Part 13

Quick summary

MPI

Distributed-process communication across one or many machines.

Commands

mpicc, mpirun, mpiexec, and often srun on Slurm clusters.

GPUs

Used for local acceleration while MPI coordinates the distributed pieces.

Slurm

The scheduler that allocates nodes, CPUs, memory, and GPUs before launch.