Skip to content

LIKWID and Monitoring

Thomas Gruber edited this page Aug 25, 2023 · 6 revisions

LIKWID and Monitoring systems

In cluster environments, often a monitoring system is in place to track activity on the system. Some are even a combination of ressource management systems (like SLURM, PBS, ...) and a monitoring solution. SLURM is one of those but there are others.

SLURM

If SLURM tracks activity with hardware counters, it will probably skew the LIKWID measurements. Not sure whether it can be disabled on a per-job basis.

HTCondor

HTCondor is a high-throughput computing software framework with ressource management and monitoring. There is currently no way to disable HTCondor's usage of hardware counters. So at the moment, don't use LIKWID on such managed systems.

Performance Co-Pilot

Performance Co-Pilot or short PCP is a system performance analysis toolkit that measures hardware performance counters in the background. In order to run own measurements, you have to disable PCP for your environment.

Disable PCP's hardware performance counting for the whole shell:

perfalloc -d

Disable PCP's hardware performance counting just for the command:

perfalloc <command>

Using LIKWID in your monitoring tool

There are many node agents available and some of them have LIKWID support, thus read hardware performance counters through LIKWID and not perf_event or PAPI. Lately, we got reports that sometimes, the values gathered with LIKWID are wrong/off/physically impossible. We did some investigation and still don't know the exact reason but it is caused by high system call times when the system is under load. It might be caused by security mitigations for hardware flaws like Meltdown. The startCounters() function is non critical as all counter accesses (and therefore system calls) are performed before the timer is started but the stopCounters() function is problematic. The first operation of stopCounters() is to stop the timer and then read out the counters (system calls). If each system call is taking longer, it might happen that the last system call is issued X seconds after stopping the timer. In this time, the counter keeps incrementing because there is load in the system (memory accesses, FP operations, ...). So the counter might be much higher as expected and deriving time-based metrics like bandwidths fails.

There is a possible solution to fix this. The monitoring agent (with LIKWID) needs a different scheduling policy than the other applications. We got reports that the round-robin policy fixes it:

# chrt --rr <prio> ./my-monitong-daemon

According to tests, the priority does not matter, something between 1 and 99 works.

Another solution would be to increase the measurement time to reduce the effect of the slow system calls.

Just changing the niceness is commonly not enough. We got reports for this issue from different centers using official RHEL as well as RockyLinux on their HPC systems. Interestingly, NHR@FAU uses AlmaLinux, another RHEL-derived distribution, and does not see these issues ... yet.

Clone this wiki locally