Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD family 25, model 68 support missing (e.g. AMD Ryzen 7 PRO 6850U) #635

Open
LadnerJonas opened this issue Sep 27, 2024 · 11 comments
Open

Comments

@LadnerJonas
Copy link

Why do you need support for this specific architecture?
This CPU is a Zen3+ model, which is used for plenty of recent thinkpads.

Which architecture model, family and further information? CPU

> cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 68
model name      : AMD Ryzen 7 PRO 6850U with Radeon Graphics
stepping        : 1
microcode       : 0xa404105

> likwid-perfctr --version
likwid-perfctr -- Version 5.3.0

> uname -a
Linux 6.6.47-1-Manjaro Linux

Is the documentation of the hardware counters publicly available?
Linux perf is working fine, so I assume yes.
Are there already any usable tools (commercial or open-source)?

> perf stat /bin/true           

 Performance counter stats for '/bin/true':

              0,56 msec task-clock                       #    0,333 CPUs utilized             
                 1      context-switches                 #    1,796 K/sec                     
                 0      cpu-migrations                   #    0,000 /sec                      
                52      page-faults                      #   93,367 K/sec                     
         1.652.810      cycles                           #    2,968 GHz                       
           569.581      stalled-cycles-frontend          #   34,46% frontend cycles idle      
           935.173      instructions                     #    0,57  insn per cycle            
                                                  #    0,61  stalled cycles per insn   
           212.933      branches                         #  382,325 M/sec                     
            26.256      branch-misses                    #   12,33% of all branches           

       0,001672605 seconds time elapsed

       0,001560000 seconds user
       0,000000000 seconds sys

If you need any more information, I am happy to provide it.
Thank you!

@TomTheBear
Copy link
Member

Unfortunately, I cannot find matching documentation at AMD TechDocs for family 19h model 44h. Getting the information out of the kernel is possible but time-consuming.

@tnibler
Copy link

tnibler commented Sep 30, 2024

It's a huge pain, PPR docs for the last 3 (actually 4 now) generations are missing and AMD support just says they will be released "later".

In the meantime I do have some time to give it a shot myself (model 75h personally). There is the BKDG), but lots of events are missing from that so I guess the files in perf are the reference? Is there some method or do you just check all event numbers and bitmasks one by one, and how likely is it that big changes are needed? Because just changing the model number in perfmon_zen4_counters.h does work, but I'd like to make sure the counters are not mixed up and producing wrong results.

Thank you :)

@tnibler
Copy link

tnibler commented Oct 10, 2024

Update: seems like there are at least some differences between model 75h and the Zen4 counter mappings in likwid (don't know about 68h), and even a good number of e.g. cache events in perf don't work or give weird values compared to uprof. AMD support still won't send over manuals, soo yeah I guess that's that for the time being.

@TomTheBear
Copy link
Member

You mean there are some wrong configs in LIKWID regarding Model 75H? Could you point out a few so I have a starting point?

As far as I know the code for perf_event in the Linux kernel for AMD chips, there is basically only a differentiation between AMD K17 and K19. Only for K17, the cache events are defined in perf_event directly. K19 uses the same list. I never compared the two K's that deeply. Not sure whether this is true or a mistake.

@tnibler
Copy link

tnibler commented Oct 10, 2024

That was a misleading way to phrase it, I'm not sure LIKWID is doing anything wrong actually. perf has the same L3 lookup state events as LIKWID, but they are not shown in perf list and don't work if referred to by name with perf stat.

likwid-perfctr debug-prints Cannot access counter register CPMC0 and only shows 0 counts with -g L3CACHE, but actually perf can also not read anything if given -e r04ff for instance for L3 lookup state. The rest is just baseless speculation, I don't really know how to debug much further than that.

@TomTheBear
Copy link
Member

The important part in the linked JSON file is: https://github.com/torvalds/linux/blob/9852d85ec9d492ebef56dc5f229416c925758edc/tools/perf/pmu-events/arch/x86/amdzen4/cache.json#L658

The L3PMC is similar to LIKWID's CPMC (Cache Performance Monitoring Counter). It is a different unit, you have to specify explicitly. For LIKWID it is encoded in the counter name CPMC0, for perf, you have to specify it -e amd_l3/config=0x04ff/. But if your LIKWID installation was built with ACCESSMODE=perf_event, the reason why neither LIKWID nor perf work is that the perfmon unit is not exposed by your system through perf_event (folder /sys/devices/amd_l3 does not exist). LIKWID with ACCESSMODE=accessdaemon is capable of using these units even if not exposed by the kernel. But since you run on some laptop-dedicated chip, the unit might really not exist at hardware level.

@tnibler
Copy link

tnibler commented Oct 10, 2024

/sys/devices/amd_l3 does indeed not exist, but with the msr module and Linux hardening stuff disabled everything seems to work, thank you very much!

Although when using the marker API with -m I get Cannot access counter register CPMC0 again :/ I'll look into that, but it's still likwid-accessD doing the MSR access right (I've setcapped every binary involved, so it can't be that I think).

@TomTheBear
Copy link
Member

Make sure you rebuild LIKWID completely (make distclean && make) after changes to config.mk and ensure at runtime that your application finds the right LIKWID library. I have often seen these issues with multiple LIKWID installations with a wrong pick by the linker at runtime.

If I understand the capabilities system correctly, you have to use setcap on your application. The access daemon inherits the capabilities of your application. But I have not played around with capabilities much but enough to tell most users to not use it since you have to give capabilities to the Lua interpreter (likwid-lua). So every Lua script executed by this interpreter gets the MSR access capabilities.

@tnibler
Copy link

tnibler commented Oct 10, 2024

Hmm, every binary (and .so?) appearing in strace -f ... | grep exec has been setcapped (too many for comfort) and it still does it. PMC events work, CPMC without marker works so there must be something. But whatever it's fine, it's not super necessary and not worth the security implications as you said. The important stuff does work fine.

What's your policy for adding in more supported models in topology.h then? They're all checked one by one in if statements, so it might get a bit janky to add 20 models per vendor per year.

@TomTheBear
Copy link
Member

As I said, I do not have much experience with capabilities. All my assumptions might be wrong. If PMC works, it sounds like a different issue. Can you please provide the output of a run with -V 3 (as file) with the CPMC counters.

The whole topology lookup code was already there when I took over the project. In the meantime, it got quite fat, correct. For some architecture, we create a macro like ARCHGROUP(arch) (((arch) == X) || ((arch) == Y)) to simplify it in the code. But, nevertheless, the topology code needs a major update, so there is an opportunity to make it better.

@tnibler
Copy link

tnibler commented Oct 10, 2024

https://gist.github.com/tnibler/ffdb00f27dfdfaae4522448934053ae1

In order: CPMC, marker (broken) - CPMC no marker (works, just measures the crash b/c run without marker) - PMC with marker (works).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants