riscv: boot failure for 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
From: Drew Fustini
Date: Sat May 20 2023 - 22:13:49 EST
Hello, I tested 6.4-rc1 on an internal RISC-V SoC and observed a boot
failure on a Store/AMO access fault (exception code 7) in __memset().
stval (e.g. badaddr) was set to 0xffffaf8000000000. This SoC is RV64GC
with Sv48 so it seems that address is the start of the "direct mapping
of all physical memory" [1].
The 6.3 release boots okay and the system is able to operate correctly
with an Ubuntu 23.04 rootfs on eMMC. Therefore, I decided to bisect and
I found the failure begins with 3335068f8721 ("riscv: Use PUD/P4D/PGD
pages for the linear mapping"). The system boots okay with the prior
commit 8589e346bbb6 ("riscv: Move the linear mapping creation in its
own function").
The boot log [2] shows that the fault happens right after buildroot's
init script [3] uses switch_root to execute init from the Ubuntu rootfs
on the eMMC.
DWARF4 is enabled in .config [4] and the decoded stack trace [5] shows:
epc : __memset (/eng/dfustini/gitlab/linux/arch/riscv/lib/memset.S:67)
>From memset.S:
Line 67: REG_S a1, 0(t0)
>From the oops:
epc : ffffffff81122d6c ra : ffffffff80218504 sp : ffffaf8002e47500
gp : ffffffff82695010 tp : ffffaf8002e2ec00 t0 : ffffaf8000000000
t1 : 0000000000000080 t2 : 0000000000000001 s0 : ffffaf8002e47550
s1 : ffff8d8200000040 a0 : ffffaf8000000000 a1 : 0000000000000000
Thus I think it is trying to store 0x0 to 0xffffaf8000000000 which is
the start of the direct map. From the boot log [2], OpenSBI shows:
Domain0 Region00 : 0x0000000002080000-0x00000000020bffff M: (I,R,W) S/U: ()
Domain0 Region01 : 0x0000008000000000-0x000000800003ffff M: (R,W,X) S/U: ()
Domain0 Region02 : 0x0000000002000000-0x000000000207ffff M: (I,R,W) S/U: ()
Domain0 Region03 : 0x0000000000000000-0xffffffffffffffff M: (R,W,X) S/U: (R,W,X)
The DDR memory on this SoC starts at 0x8000000000 with size 2GB. The
memory node from the device tree [6]:
memory@8000000000 {
device_type = "memory";
reg = <0x80 0 0x00000000 0x80000000>;
};
I think the direct map address 0xffffaf8000000000 would map to physical
address 0x8000000000. Thus I think the attempted store in S-mode to that
address would violate the PMP settings for Region01.
I do not yet understand why this happens with 3335068f8721 ("riscv: Use
PUD/P4D/PGD pages for the linear mapping") but not for the prior commit
8589e346bbb6 ("riscv: Move the linear mapping creation in its own
function").
One important cavaet: I do have a small diff from mainline to add
support for the eMMC controller in this SoC to sdhci-of-dwcmshc.c. The
output of 'git diff' when 3335068f8721 is checked out [7] shows that
this just adds a new compatible and corresponding sdhci_ops struct.
Everything works ok with this change in both the 6.3 release and the
commit prior to 3335068f8721.
I know it is a bit awkward for me to report a boot failure for an
internal SoC but I am hoping to find a better solution than just
reverting this change in the downstream kernel.
The reason that so few changes are needed to run Linux on this SoC is
that there is a service processor that handles all the low-level tasks
like setting up clocks and configuring various peripheral controllers.
Everything is already setup and ready to go by the time the hart meant
to run OpenSBI+Linux (fw_payload.bin) comes out of reset.
Note: normally Linux runs on all four harts but I reduced to running on
a single hart to simplify diagnosing this boot failure.
Thanks,
Drew
[1] https://docs.kernel.org/riscv/vm-layout.html#risc-v-linux-kernel-sv48
[2] boot log: https://gist.github.com/pdp7/afe78604f477c9e3a3cf0241bcdffcdb
[3] init script: https://gist.github.com/pdp7/8d61bafbca55e987b790433c0353831d
[4] linux .config: https://gist.github.com/pdp7/a4df66f1359a34194bddd32f74ab38a3
[5] stacktrace: https://gist.github.com/pdp7/0524892ea319775ea70e43a54cc842a9
[6] mysoc.dts: https://gist.github.com/pdp7/cd1b2e8e8d3f6047efd53e4ef65664da
[7] git diff: https://gist.github.com/pdp7/581c9e8415da94a29d34ae6d7cc14669