Skip to content

MemoryMeasurementsBDX

Thomas Roehl edited this page Jun 10, 2020 · 1 revision

Memory Measurements on Intel Broadwell EP/EX

This blog post describes the case when memory measurements on Intel Broadwell EP/EX systems are too high.

Problem

If you measure the memory traffic on Intel Broadwell EP/EX with the MEM* groups, some systems return strange high numbers. I got some reports about that from different computing centers.

+------------------------------------------+---------+------------+-----------------+
|                   Event                  | Counter |   Core 0   |     Core 10     |
+------------------------------------------+---------+------------+-----------------+
|             INSTR_RETIRED_ANY            |  FIXC0  | 2093481000 |      1528120000 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  | 5590396000 |      5125980000 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  | 3975666000 |      3638550000 |
|              PWR_PKG_ENERGY              |   PWR0  |    53.7656 |         55.2363 |
|              PWR_DRAM_ENERGY             |   PWR3  |    11.8070 |         12.5511 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |               0 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |     379993 |          379990 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |  902215000 |       902215000 |
|               CAS_COUNT_RD               | MBOX0C0 |   56361240 |        56586960 |
|               CAS_COUNT_WR               | MBOX0C1 |   28293350 |        28353280 |
|               CAS_COUNT_RD               | MBOX1C0 |   56474850 |        56681730 |
|               CAS_COUNT_WR               | MBOX1C1 |   28406420 |        28456820 |
|               CAS_COUNT_RD               | MBOX2C0 |   56366030 |        56582170 |
|               CAS_COUNT_WR               | MBOX2C1 |   28295140 |        28349200 |
|               CAS_COUNT_RD               | MBOX3C0 |   56362180 |        56567480 |
|               CAS_COUNT_WR               | MBOX3C1 |   28293360 |        28344380 |
|               CAS_COUNT_RD               | MBOX4C0 |   56595560 | 141008100000000 |
|               CAS_COUNT_WR               | MBOX4C1 |   28384190 | 141008100000000 |
|               CAS_COUNT_RD               | MBOX5C0 |   56596110 | 141008100000000 |
|               CAS_COUNT_WR               | MBOX5C1 |   28384440 | 141008100000000 |
|               CAS_COUNT_RD               | MBOX6C0 |          0 |               0 |
|               CAS_COUNT_WR               | MBOX6C1 |          0 |               0 |
|               CAS_COUNT_RD               | MBOX7C0 |          0 |               0 |
|               CAS_COUNT_WR               | MBOX7C1 |          0 |               0 |
+------------------------------------------+---------+------------+-----------------+

MBOX4-5 should not be active.

What is the problem

Commonly, the problem comes from only partly deactivated memory controller counter registers. Intel Broadwell EP systems have 4 memory channels active in most cases. LIKWID does not know how many channels (PCI devices) are active, it tests all and marks them available if all tests are positive. Besides checking the availibility of the PCI devices, it also tries to read and write to the counter registers. So for these unreliable systems, the checks pass successfully for 6 channels. Often, the additional memory channel devices return zero, so it does not make any difference.

How to fix it

Since there is not much LIKWID can do (besides the accessibility checks), the only way is to update the MEM* groups for the systems and remove the faulty memory channels:

cp <LIKWID_BASE>/share/likwid/perfgroups/broadwellEP/MEM.txt ~/.likwid/groups/broadwellEP/MEM_BDX.txt
edit ~/.likwid/groups/broadwellEP/MEM_BDX.txt
  - remove unneeded registers
  - update metric formulas
likwid-perfctr -g MEM_BDX ...
Clone this wiki locally