-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGBUS when starting with --record #77
Comments
After digging through old rasdaemon logs, I found that the SIGBUS correlates with the following log changes Before:
After:
My CPU is a AMD Ryzen Threadripper 3960X, which has 24 cores or 48 threads. |
I downgraded my kernel to v5.17 (which I was using months before the problem started) and still experienced the same error. |
I've worked around this problem with the following patch for now, so rasdaemon only listens to online CPU events: diff --git a/ras-events.c b/ras-events.c
index 39cab20..319f049 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -328,7 +328,7 @@ static void parse_ras_data(struct pthread_data *pdata, struct kbuffer *kbuf,
static int get_num_cpus(struct ras_events *ras)
{
- return sysconf(_SC_NPROCESSORS_CONF);
+ return sysconf(_SC_NPROCESSORS_ONLN);
#if 0
char fname[MAX_PATH + 1];
int num_cpus = 0; Not sure if it's an acceptable fix to be merged into the repo, but apparently the proper fix is to use libtracefs? For convenience, the above patch can be applied directly to the binary in a hex editor by:
This is at file offset 0xdb3f for my version of rasdaemon (debian 0.6.7-1+b1) Or just execute this perl one-liner to do the above: sudo perl -i -pe 's/\xbf\x53\x00\x00\x00/\xbf\x54\x00\x00\x00/' /sbin/rasdaemon This effectively makes the following binary patch: --- rasdaemon.S.before 2022-12-10 19:25:27.114060904 +1100
+++ rasdaemon.S.after 2022-12-10 19:25:33.382269283 +1100
@@ -2670,11 +2670,11 @@
db2f: 41 5a pop %r10
db31: 41 5b pop %r11
db33: 85 c0 test %eax,%eax
db35: 0f 85 e5 05 00 00 jne e120 <__cxa_finalize@plt+0x2a60>
db3b: 83 45 b8 01 addl $0x1,-0x48(%rbp)
- db3f: bf 53 00 00 00 mov $0x53,%edi
+ db3f: bf 54 00 00 00 mov $0x54,%edi
db44: e8 a7 d6 ff ff call b1f0 <sysconf@plt>
db49: 48 89 df mov %rbx,%rdi
db4c: 89 c6 mov %eax,%esi
db4e: 89 45 a8 mov %eax,-0x58(%rbp)
db51: 49 89 c5 mov %rax,%r13 |
My CPU is a "AMD Ryzen 7 2700 Eight-Core Processor". |
Not sure if this is the right fix, as CPUs can be dynamically disabled/enabled in runtime, probably decreasing _SC_NPROCESSORS_ONLN. See, if I write this small test.c program:
building it with
It will report 7 online cpus of 8 total ones. Rasdaemon should monitor all 8, as cpu4 can be placed online anytime. With your change, it will not monitor the last CPU. So, not only the disabled CPU won't be monitored, but also one that it is online. The real issue here is: why AMD is announcing more CPUs than it actually has? BIOS issue? |
I'm hitting the same issue on rasdaemon 0.8.0, it seems to be a use-after-free bug. An output from running the daemon under Valgrind is attached here: rasdaemon-0.8.0-crash-valgrind.txt First invalid access is this:
|
Commit f1ea763 has applied my suggested change so this should be fixed now |
I use Debian Stable, and I also have SIGBUS signals very frequently.
Processor is AMD Ryzen 7 5800X (16). I understand this is now fixed? Need to wait until it lands in my distribution? |
Hi @github12101 . You question is related to Debian. I believe you may want to report what happened to you here https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054152 rather than this upstream issue tracker. |
- Why I did it Booting SONiC on a AMD EPYC 16-Core CPU is causing rasdaemon to crash. This is not a major blocker because rasdaemon eventually restarts and is stable after a point. Coredump stack trace: [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/sbin/rasdaemon -f -r'. Program terminated with signal SIGBUS, Bus error. #0 0x00007f74f62af7f4 in sqlite3_finalize () from /lib/x86_64-linux-gnu/libsqlite3.so.0 [Current thread is 1 (Thread 0x7f73c8ff96c0 (LWP 17416))] Known issue for rasdaemon: mchehab/rasdaemon#77 Fixed here: mchehab/rasdaemon@f1ea763 Unfortunately this fix is not present in the default bookworm version. So, backported the fix and compiled rasdaemon from source Here is the patch: https://sources.debian.org/patches/rasdaemon/0.8.0-2/0001-Check-CPUs-online-not-configured.patch/ - How I did it - How to verify it Booted the image built with these changes and no issue in observed Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
Since linux v6.0.8 (which contains the fix for issue #73), rasdaemon will crash with SIGBUS when launched with the
--record
option.Debugging under gdb (with an extra
--foreground
option) shows it seems to be from sqlite code:At present I haven't investigated any further, so I don't know if it's actually a sqlite problem or a rasdaemon problem yet
The text was updated successfully, but these errors were encountered: