Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: pthread_setaffinity_np for core 1 failed with code 22 Marking core 1 offline #744

Open
cll24 opened this issue May 15, 2024 · 20 comments

Comments

@cll24
Copy link

cll24 commented May 15, 2024

There are 128 cores in my server, the pcm-pcie said "the number of online logical cores: 2". It can not detect the other 126 cores as online. But when i exec lscpu in my server, these 128 cores are all online. How can i fix this problem.
image

@opcm
Copy link
Contributor

opcm commented May 15, 2024

are you running pcm in a restricted cgroup?

@torshepherd
Copy link

Hijacking this thread, I'm trying to add pcm to an online monitoring system but in a restricted cgroup - is pcm not able to be run restricted to just some cpus?

@rdementi
Copy link
Contributor

pcm-pcie needs all cpus (like in the example above). Other pcm utilities typically don't need all cpus but pcm won't be able to show per-cpu stats for excluded cpus.

@torshepherd
Copy link

Thanks for the response.

My use case is using pcm as a library, intermittently getting all the counter states (for all cores, not just the ones pcm's cgroup is limited to) to compute QPI metrics. I can set the cpuset for the cgroup to all cores to initialize PCM instance, but I would ideally restrict it for calling getCounterStates - does getCounterStates need all cpus if the initialization had all cpus?

@rdementi
Copy link
Contributor

it might or might not work. This scenario is not tested.

@torshepherd
Copy link

Ok I've tested with moving it to a restricted cgroup after initialization. This gives errors in some metrics, but the QPI Utilization seems to work fine in a restricted c group.

When I restrict it before initialization, however, I get an exception thrown in discoverSystemTopology: line 1082.

Do we need the topologyDomainMap to get QPI metrics across all sockets & links?

@torshepherd
Copy link

Seems like there are a couple of places where we pin to core 0:

TemporalThreadAffinity aff0(0);

Could this instead pin to an available core within the cgroup?

@rdementi
Copy link
Contributor

Seems like there are a couple of places where we pin to core 0:

TemporalThreadAffinity aff0(0);

Could this instead pin to an available core within the cgroup?

Let me see...
I see just one place. Could you please point to the other?

@torshepherd
Copy link

The other one is within "readCPUMicrocodeLevel"

@rdementi
Copy link
Contributor

could you try changing 0 to socketRefCore[0] ?
does it work then?

@torshepherd
Copy link

I tried hardcoding it to 2, which I know is in the cluster of the cgroup. I added some try catches around the rest, but I can't get any QPI measurements unfortunately :/ but at least it doesn't crash

@torshepherd
Copy link

Basically getting output similar to the following:

ERROR: pthread_set_affinity_np for core 0 failed with code 22
Marking core 0 offline
... repeat for all cores except 2-9
PCM warning: total_sockets_ 1 does not match socket2M2Mbus.size() 2 // I think this is because the cgroup is only on one of the two Numa nodes
Socket 0: ... 0 UPI ports detected.

But I expect 3 links and 2 sockets, so I guess the topology marking cores offline affects the number of QPI ports?

@rdementi
Copy link
Contributor

But I expect 3 links and 2 sockets, so I guess the topology marking cores offline affects the number of QPI ports?

yes, PCM thinks on single-socket systems UPI links don't need to be detected because UPI links are only there to connect 2 or more sockets...

@torshepherd
Copy link

Ah of course, thanks for pointing that out.

Is the setting of thread affinity necessary to detect that there are multiple sockets?

@rdementi
Copy link
Contributor

yes. you need at least one core on the other socket to be in the cgroup

@torshepherd
Copy link

Ok I fully see the problem. When initializing pcm in a restricted cgroup, the try block that populates Entry and fills the socketIdMap fails to add the cores from the other socket due to not being able to set affinity on cores outside of the cgroup, instead going to the catch block "Marking core offline", thus making the system topology inaccurate.

@torshepherd
Copy link

Aha, and the reason it needs the thread affinity RAII there is so that getting the apic_id will work? Which uses pcm_cpuid, which calls the cpuid instruction, which returns "apic of the current logical processor"?

@rdementi
Copy link
Contributor

correct

@torshepherd
Copy link

Thanks for the help with this. In cases where cores are inactive, I wonder if you could just read /proc/cpuinfo to get the topology instead of running cpuid on each core..?

@rdementi
Copy link
Contributor

yes, one can experiment with this and see how far can we go to support such config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants