Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGBUS when starting with --record #77

Closed
kotarou3 opened this issue Nov 24, 2022 · 9 comments
Closed

SIGBUS when starting with --record #77

kotarou3 opened this issue Nov 24, 2022 · 9 comments

Comments

@kotarou3
Copy link

Since linux v6.0.8 (which contains the fix for issue #73), rasdaemon will crash with SIGBUS when launched with the --record option.

Debugging under gdb (with an extra --foreground option) shows it seems to be from sqlite code:

Thread 72 "rasdaemon" received signal SIGBUS, Bus error.
[Switching to Thread 0x7ffebffff6c0 (LWP 7972)]
0x00007ffff7f2d684 in sqlite3_finalize () from /lib/x86_64-linux-gnu/libsqlite3.so.0
(gdb) bt
#0  0x00007ffff7f2d684 in sqlite3_finalize () from /lib/x86_64-linux-gnu/libsqlite3.so.0
#1  0x00005555555644d0 in ?? ()
#2  0x0000555555561738 in ?? ()
#3  0x00007ffff7ce2fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007ffff7d6366c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

At present I haven't investigated any further, so I don't know if it's actually a sqlite problem or a rasdaemon problem yet

@kotarou3
Copy link
Author

kotarou3 commented Dec 10, 2022

After digging through old rasdaemon logs, I found that the SIGBUS correlates with the following log changes

Before:

rasdaemon: Listening to events for cpus 0 to 47

After:

rasdaemon: Listening to events for cpus 0 to 127
rasdaemon: Error on CPU 48
rasdaemon: Error on CPU 49
...
rasdaemon: Error on CPU 126
rasdaemon: Error on CPU 127
rasdaemon: Old kernel detected. Stop listening and fall back to pthread way.

My CPU is a AMD Ryzen Threadripper 3960X, which has 24 cores or 48 threads.
The dates on the logs line up with when I upgraded from Linux v5.19 to v6.0.
So sometime between then the kernel started reporting more CPUs than actually exist, and rasdaemon is unable to handle it properly

@kotarou3
Copy link
Author

I downgraded my kernel to v5.17 (which I was using months before the problem started) and still experienced the same error.
So I think it's not directly caused by the kernel, but perhaps might be firmware related?

@kotarou3
Copy link
Author

I've worked around this problem with the following patch for now, so rasdaemon only listens to online CPU events:

diff --git a/ras-events.c b/ras-events.c
index 39cab20..319f049 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -328,7 +328,7 @@ static void parse_ras_data(struct pthread_data *pdata, struct kbuffer *kbuf,
 
 static int get_num_cpus(struct ras_events *ras)
 {
-       return sysconf(_SC_NPROCESSORS_CONF);
+       return sysconf(_SC_NPROCESSORS_ONLN);
 #if 0
        char fname[MAX_PATH + 1];
        int num_cpus = 0;

Not sure if it's an acceptable fix to be merged into the repo, but apparently the proper fix is to use libtracefs?


For convenience, the above patch can be applied directly to the binary in a hex editor by:

  1. Searching for bf 53 00 00 00
  2. Replacing it with bf 54 00 00 00

This is at file offset 0xdb3f for my version of rasdaemon (debian 0.6.7-1+b1)

Or just execute this perl one-liner to do the above:

sudo perl -i -pe 's/\xbf\x53\x00\x00\x00/\xbf\x54\x00\x00\x00/' /sbin/rasdaemon

This effectively makes the following binary patch:

--- rasdaemon.S.before	2022-12-10 19:25:27.114060904 +1100
+++ rasdaemon.S.after	2022-12-10 19:25:33.382269283 +1100
@@ -2670,11 +2670,11 @@
     db2f:	41 5a                	pop    %r10
     db31:	41 5b                	pop    %r11
     db33:	85 c0                	test   %eax,%eax
     db35:	0f 85 e5 05 00 00    	jne    e120 <__cxa_finalize@plt+0x2a60>
     db3b:	83 45 b8 01          	addl   $0x1,-0x48(%rbp)
-    db3f:	bf 53 00 00 00       	mov    $0x53,%edi
+    db3f:	bf 54 00 00 00       	mov    $0x54,%edi
     db44:	e8 a7 d6 ff ff       	call   b1f0 <sysconf@plt>
     db49:	48 89 df             	mov    %rbx,%rdi
     db4c:	89 c6                	mov    %eax,%esi
     db4e:	89 45 a8             	mov    %eax,-0x58(%rbp)
     db51:	49 89 c5             	mov    %rax,%r13

@hamelg
Copy link

hamelg commented Jan 7, 2023

My CPU is a "AMD Ryzen 7 2700 Eight-Core Processor".
I confirm this issue and the fix.
Thanks :)

@mchehab
Copy link
Owner

mchehab commented Jan 21, 2023

Not sure if this is the right fix, as CPUs can be dynamically disabled/enabled in runtime, probably decreasing _SC_NPROCESSORS_ONLN. See, if I write this small test.c program:

#include <stdio.h>
#include <unistd.h>
int main(void)
{
	printf ("Number of cpus: %ld\n", sysconf(_SC_NPROCESSORS_CONF));
	printf ("Number of online cpus: %ld\n", sysconf(_SC_NPROCESSORS_ONLN));

	return 0;
}

building it with gcc -o test test.c and then doing:

# grep . /sys/devices/system/cpu/online /sys/devices/system/cpu/offline
/sys/devices/system/cpu/online:0-7
# echo 0 > /sys/devices/system/cpu/cpu4/online
# echo 0 > /sys/devices/system/cpu/cpu4/online
# grep . /sys/devices/system/cpu/online /sys/devices/system/cpu/offline
/sys/devices/system/cpu/online:0-3,5-7
/sys/devices/system/cpu/offline:4
$ ./test
Number of cpus: 8
Number of online cpus: 7

It will report 7 online cpus of 8 total ones. Rasdaemon should monitor all 8, as cpu4 can be placed online anytime. With your change, it will not monitor the last CPU. So, not only the disabled CPU won't be monitored, but also one that it is online.

The real issue here is: why AMD is announcing more CPUs than it actually has? BIOS issue?

@vidraj
Copy link

vidraj commented Mar 25, 2023

I'm hitting the same issue on rasdaemon 0.8.0, it seems to be a use-after-free bug. An output from running the daemon under Valgrind is attached here: rasdaemon-0.8.0-crash-valgrind.txt

First invalid access is this:

==25802== Invalid read of size 8
==25802==    at 0x11C906: ras_mc_event_closedb (ras-record.c:918)
==25802==    by 0x117DB7: handle_ras_events_cpu (ras-events.c:640)
==25802==    by 0x4A8D389: start_thread (pthread_create.c:442)
==25802==    by 0x4B0D5BF: clone (clone.S:100)
==25802==  Address 0x17653f00 is 0 bytes inside a block of size 72 free'd
==25802==    at 0x484440F: free (vg_replace_malloc.c:884)
==25802==    by 0x11C9FC: ras_mc_event_closedb (ras-record.c:1020)
==25802==    by 0x117DB7: handle_ras_events_cpu (ras-events.c:640)
==25802==    by 0x4A8D389: start_thread (pthread_create.c:442)
==25802==    by 0x4B0D5BF: clone (clone.S:100)
==25802==  Block was alloc'd at
==25802==    at 0x4846C0F: calloc (vg_replace_malloc.c:1340)
==25802==    by 0x11C50B: ras_mc_event_opendb (ras-record.c:768)
==25802==    by 0x117D37: handle_ras_events_cpu (ras-events.c:628)
==25802==    by 0x4A8D389: start_thread (pthread_create.c:442)
==25802==    by 0x4B0D5BF: clone (clone.S:100)

@kotarou3
Copy link
Author

Commit f1ea763 has applied my suggested change so this should be fixed now

@github12101
Copy link

I use Debian Stable, and I also have SIGBUS signals very frequently.

$ sudo coredumpctl list --no-pager | grep rasdaemon | tail -n 25
[sudo] password for pioruns: 
Sun 2023-11-12 12:38:48 GMT 4016656    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Mon 2023-11-13 08:51:57 GMT 4160630    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Tue 2023-11-14 03:53:52 GMT  402897    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Wed 2023-11-15 10:23:00 GMT 1403843    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Thu 2023-11-16 09:41:19 GMT 1403987    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Thu 2023-11-16 16:43:21 GMT     975    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Fri 2023-11-17 23:53:52 GMT 1536919    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Fri 2023-11-17 23:53:53 GMT 2517616    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Sun 2023-11-19 10:06:24 GMT 2517665    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Sun 2023-11-19 10:06:25 GMT 3668922    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Sun 2023-11-19 10:06:25 GMT 3668988    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Tue 2023-11-21 03:33:34 GMT     936    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Wed 2023-11-22 10:20:02 GMT  220536    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Wed 2023-11-22 10:20:03 GMT 3081318    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Wed 2023-11-22 10:20:04 GMT 3083389    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Thu 2023-11-23 10:50:47 GMT 3085716    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Fri 2023-11-24 07:13:39 GMT 1140728    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Sat 2023-11-25 10:04:15 GMT  389719    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Sun 2023-11-26 07:17:43 GMT  506670    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Mon 2023-11-27 10:24:38 GMT 1766908    0    0 SIGBUS  missing  /usr/sbin/rasdaemon                                                                                -
Sat 2023-12-09 10:38:51 GMT 2058458    0    0 SIGBUS  present  /usr/sbin/rasdaemon                                                                           272.5K
Sun 2023-12-10 12:05:05 GMT 2058605    0    0 SIGBUS  present  /usr/sbin/rasdaemon                                                                           272.4K
Mon 2023-12-11 07:55:29 GMT 3163873    0    0 SIGBUS  present  /usr/sbin/rasdaemon                                                                           273.0K
Mon 2023-12-11 08:04:42 GMT 3195740    0    0 SIGBUS  present  /usr/sbin/rasdaemon                                                                           270.4K
Mon 2023-12-11 10:06:22 GMT     950    0    0 SIGBUS  present  /usr/sbin/rasdaemon                                                                           268.5K

Processor is AMD Ryzen 7 5800X (16). I understand this is now fixed? Need to wait until it lands in my distribution?

@tai271828
Copy link

Hi @github12101 . You question is related to Debian. I believe you may want to report what happened to you here https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054152 rather than this upstream issue tracker.

liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this issue May 22, 2024
- Why I did it
Booting SONiC on a AMD EPYC 16-Core CPU is causing rasdaemon to crash. This is not a major blocker because rasdaemon eventually restarts and is stable after a point.

Coredump stack trace:

[Thread debugging using libthread_db enabled]                                                                                                                                                                                                                               
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".                                                                                                                                                                                                  
Core was generated by `/usr/sbin/rasdaemon -f -r'.                                                                                                                                                                                                                          
Program terminated with signal SIGBUS, Bus error.                                                                                                                                                                                                                           
#0  0x00007f74f62af7f4 in sqlite3_finalize () from /lib/x86_64-linux-gnu/libsqlite3.so.0                                                                                                                                                                                    
[Current thread is 1 (Thread 0x7f73c8ff96c0 (LWP 17416))]    
Known issue for rasdaemon: mchehab/rasdaemon#77

Fixed here:
mchehab/rasdaemon@f1ea763

Unfortunately this fix is not present in the default bookworm version. So, backported the fix and compiled rasdaemon from source

Here is the patch: https://sources.debian.org/patches/rasdaemon/0.8.0-2/0001-Check-CPUs-online-not-configured.patch/

- How I did it

- How to verify it
Booted the image built with these changes and no issue in observed

Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants