Skip to content

Likwid Mpirun

Thomas Roehl edited this page Jul 20, 2015 · 8 revisions

likwid-mpirun: enable simple pinning for MPI and hybrid MPI/threaded applications

Introduction

Pinning to dedicated compute resources is important for pure MPI and even more for hybrid MPI/threaded applications. While all major MPI implementations include their mechanism for pinning, likwid-mpirun provides a simple and portable solution based on the powerful capabilities of likwid-pin. This is still experimental at the moment. Still it can be adopted to any MPI and OpenMP combination with the help of a tuning application in the test directory of LIKWID. likwid-mpirun works in conjunction with PBS, LoadLeveler and SLURM. The tested MPI and compilers are Intel C/C++ compiler, GCC, Intel MPI and OpenMPI. The support for mvapich is untested.

Usage

As usual you can get a help message with

$ likwid-mpirun -h

You always have to specify the total number of MPI processes with the -np NUMPROC. Two cases are distinguished: Pure MPI and hybrid applications.

Pure MPI:

$ likwid-mpirun -np 16 ./a.out

This will start 16 processes, the number of processes per compute node is calculated from the PBS/LoadLeveler/SLURM node file. If two hosts are given, eight processes are pinned to cores/SMT threads per node. The pinning is implemented with the likwid-pin node domain.

Pure MPI with explicit pinning:

$ likwid-mpirun -np 16 -nperdomain S:2 ./a.out

For this case a single option -nperdomain covers all cases. The argument contains a domain character as already known from the other LIKWID applications and the number per domain separated by a colon. Above example will start two processes per socket up to 16 processes and will pin the processes with likwid-pin.

Domains can be:

  • N - for node
  • S - for socket
  • C - for last level shared cache
  • M - for NUMA domain (interesting e.g. for AMD Magny Cours)

For pinning on Magny Cours the following can be useful:

$ likwid-mpirun -np 16 -nperdomain M:2 ./a.out

This will start 2 processes per NUMA domain. On a two socket AMD MagnyCours system this will result in 8 processes per node with two nodes total for this run.

For debugging use the debug option:

$ likwid-mpirun -debug -np 16 -nperdomain M:2 ./a.out

This will output all command which would be executed.

Pinning of hybrid applications:

$ likwid-mpirun  -np 16 -pin S0:0,1_S1:0,1 ./a.out

Hybrid pinning has only one option covering all possibilities with -pin. The argument string to pin consists of valid likwid-pin expressions separated by underscores. The number of separated expression denote the number of processes started per node. Above example will start two processes per node. The first process and its threads (two) will be pinned to Socket one, core 0,1. The second process and its threads will be pinned to socket two, core 0,1. Consequently, the above statement requires 4 hosts to run.

The main pinning complexity is that the OpenMP as well as the MPI implementation could start their own threads for management purpose. These threads need to be skipped and their position in the started threads has to be determined in advance. For the tested MPI+Compiler combinations, the skip masks are integrated into likwid-mpirun.

At the moment all pinning uses block distribution, round robin variants for node and global are planned.

Options

-h, --help		 Help message
-v, --version		 Version information
-d, --debug		 Debugging output
-n/-np <count>		 Set the number of processes
-nperdomain <domain>	 Set the number of processes per node by giving an affinity domain and count
-pin <list>		 Specify pinning of threads. CPU expressions like likwid-pin separated with '_'
-s, --skip <hex>	 Bitmask with threads to skip
-mpi <id>		 Specify which MPI should be used. Possible values: openmpi, intelmpi and mvapich2
			 If not set, module system is checked
-omp <id>		 Specify which OpenMP should be used. Possible values: gnu and intel
			 Only required for statically linked executables.
-hostfile                Use custom hostfile instead of searching the environment
-g/-group <perf>	 Set a likwid-perfctr conform event set for measuring on nodes
-m/-marker               Activate marker API mode

Performance measurements of MPI and hybrid applications

Besides the correct pinning of MPI processes and their threads, the application execution can be measured using likwid-perfctr. By setting a performance group or custom event set on the command line, the call of likwid-pin is substituted with likwid-perfctr. By now, you can perform end-to-end measurements and instrumented code using the LIKWID Marker API.

Measure the energy used by all participating systems running one process per socket:

$ likwid-mpirun -nperdomain S:1 -g ENERGY ./a.out

likwid-mpirun is intelligent enough to measure socket-wide performance counters on one CPU, the others skip the reading of the hardware registers, they just read the core-local performance counters.

When measuring is activated, no overloading of the hosts is allowed. Multiple processes would read the hardware performance counters so that the final results wouldn't be valid anymore. There are plans to substitute likwid-perfctr with likwid-pin for the overloaded processes.

Clone this wiki locally