Re: [PATCH v3 6/6] RISC-V: Do not use cpumask data structure for hartid bitmap

From: Ron Economos
Date: Tue Jan 25 2022 - 16:11:47 EST


On 1/25/22 12:52, Geert Uytterhoeven wrote:
Hi Atish,

On Tue, Jan 25, 2022 at 9:17 PM Atish Patra <atishp@xxxxxxxxxxxxxx> wrote:
On Tue, Jan 25, 2022 at 12:12 PM Geert Uytterhoeven
<geert@xxxxxxxxxxxxxx> wrote:
On Thu, Jan 20, 2022 at 10:12 AM Atish Patra <atishp@xxxxxxxxxxxx> wrote:
Currently, SBI APIs accept a hartmask that is generated from struct
cpumask. Cpumask data structure can hold upto NR_CPUs value. Thus, it
is not the correct data structure for hartids as it can be higher
than NR_CPUs for platforms with sparse or discontguous hartids.

Remove all association between hartid mask and struct cpumask.

Reviewed-by: Anup Patel <anup@xxxxxxxxxxxxxx> (For Linux RISC-V changes)
Acked-by: Anup Patel <anup@xxxxxxxxxxxxxx> (For KVM RISC-V changes)
Signed-off-by: Atish Patra <atishp@xxxxxxxxxxxx>
Thanks for your patch, which is now commit 26fb751ca37846c9 ("RISC-V:
Do not use cpumask data structure for hartid bitmap") in v5.17-rc1.

I am having an issue with random userspace SEGVs on Starlight Beta
(which needs out-of-tree patches). It doesn't always manifest
itself immediately, so it took a while to bisect, but I suspect the
above commit to be the culprit.
I have never seen one before during my testing. How frequently do you see them?
Does it happen while running anything or just idle user space results
in SEGVs randomly.
Sometimes they happen during startup (lots of failures from systemd),
sometimes they happen later, during interactive work.
Sometimes while idle, and something runs in the background (e.g. mandb).

Do you have a trace that I can look into ?
# apt update
[ 807.499050] apt[258]: unhandled signal 11 code 0x1 at
0xffffff8300060020 in libapt-pkg.so.6.0.0[3fa49ac000+174000]
[ 807.509548] CPU: 0 PID: 258 Comm: apt Not tainted
5.16.0-starlight-11192-g26fb751ca378-dirty #153
[ 807.518674] Hardware name: BeagleV Starlight Beta (DT)
[ 807.524077] epc : 0000003fa4a47a0a ra : 0000003fa4a479fc sp :
0000003fcb4b39b0
[ 807.531383] gp : 0000002adcef4800 tp : 0000003fa43287b0 t0 :
0000000000000001
[ 807.538603] t1 : 0000000000000009 t2 : 00000000000003ff s0 :
0000000000000000
[ 807.545887] s1 : 0000002adcf3cb60 a0 : 0000000000000003 a1 :
0000000000000000
[ 807.553167] a2 : 0000003fcb4b3a30 a3 : 0000000000000000 a4 :
0000002adcf3cc1c
[ 807.560390] a5 : 0007000300060000 a6 : 0000000000000003 a7 :
1999999999999999
[ 807.567654] s2 : 0000003fcb4b3a28 s3 : 0000000000000002 s4 :
0000003fcb4b3a30
[ 807.575039] s5 : 0000003fa4baa810 s6 : 0000000000000010 s7 :
0000002adcf19a40
[ 807.582363] s8 : 0000003fcb4b4010 s9 : 0000003fa4baa810 s10:
0000003fcb4b3e90
[ 807.589606] s11: 0000003fa4b2a528 t3 : 0000000000000000 t4 :
0000003fa47906a0
[ 807.596891] t5 : 0000000000000005 t6 : ffffffffffffffff
[ 807.602302] status: 0000000200004020 badaddr: ffffff8300060020
cause: 000000000000000d

(-dirty due to Starlight DTS and driver updates)

Gr{oetje,eeting}s,

Geert

--

I'm not sure if it's related, but I'm also seeing a systemd segfault on boot with the HiFive Unmatched and 5.17.0-rc1. I don't have the dmesg dump, but here's the journalctl dump. It was built before the tag, so it says 5.16.0.

Jan 23 02:41:50 riscv64 systemd-udevd[551]: mmcblk0p12: Failed to wait for spawned command '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/mmcblk0p12': Invalid argument
Jan 23 02:41:50 riscv64 systemd-udevd[412]: mmcblk0p12: Process '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/mmcblk0p12' terminated by signal SEGV.
Jan 23 02:41:50 riscv64 kernel: systemd-udevd[551]: unhandled signal 11 code 0x1 at 0x0000000003938700 in udevadm[3fa7eee000+b1000]
Jan 23 02:41:50 riscv64 kernel: CPU: 2 PID: 551 Comm: systemd-udevd Not tainted 5.16.0 #1
Jan 23 02:41:50 riscv64 kernel: Hardware name: SiFive HiFive Unmatched A00 (DT)
Jan 23 02:41:50 riscv64 kernel: epc : 0000003fa7f14104 ra : 0000003fa7f14102 sp : 0000003fe3da5320
Jan 23 02:41:50 riscv64 kernel:  gp : 0000003fa7fc3ef8 tp : 0000003fa79f8530 t0 : 0000003fe3da38f0
Jan 23 02:41:50 riscv64 kernel:  t1 : 0000003fa7f0425c t2 : 0000000000000000 s0 : 0000003fcd046d88
Jan 23 02:41:50 riscv64 kernel:  s1 : 0000003fcd046d60 a0 : ffffffffffffffff a1 : 0000003fcd0cb330
Jan 23 02:41:50 riscv64 kernel:  a2 : 0000003fcd043028 a3 : 0000000000000007 a4 : c98b6a1813e46d00
Jan 23 02:41:50 riscv64 kernel:  a5 : ffffffffffffffff a6 : fefefefefefefeff a7 : 0000000000000039
Jan 23 02:41:50 riscv64 kernel:  s2 : 0000000000000000 s3 : ffffffffffffffea s4 : 0000000000000000
Jan 23 02:41:50 riscv64 kernel:  s5 : 0000003fe3da5378 s6 : ffffffffffffffea s7 : 0000000003938700
Jan 23 02:41:50 riscv64 kernel:  s8 : 0000003fe3da53e0 s9 : 0000003fe3da53d8 s10: 0000003fa7fc200c
Jan 23 02:41:50 riscv64 kernel:  s11: 0000000000081000 t3 : 0000003fa7db3822 t4 : 0000000000000000
Jan 23 02:41:50 riscv64 kernel:  t5 : 0000003fe3da38c8 t6 : 000000000000002a
Jan 23 02:41:50 riscv64 kernel: status: 0000000200004020 badaddr: 0000000003938700 cause: 000000000000000d
Jan 23 02:41:50 riscv64 systemd-udevd[412]: mmcblk0p12: Failed to wait for spawned command '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/mmcblk0p12': Input/output error
Jan 23 02:41:50 riscv64 systemd-udevd[412]: mmcblk0p12: Failed to execute '/usr/bin/unshare -m /usr/bin/snap auto-import --mount=/dev/mmcblk0p12', ignoring: Input/output error

I'll try removing this patch.

Ron