Re: riscv+KASAN does not boot

From: Dmitry Vyukov
Date: Thu Feb 18 2021 - 08:35:17 EST


On Thu, Feb 18, 2021 at 8:54 AM Alex Ghiti <alex@xxxxxxxx> wrote:
>
> Hi Dmitry,
>
> > On Wed, Feb 17, 2021 at 5:36 PM Alex Ghiti <alex@xxxxxxxx> wrote:
> >>
> >> Le 2/16/21 à 11:42 PM, Dmitry Vyukov a écrit :
> >>> On Tue, Feb 16, 2021 at 9:42 PM Alex Ghiti <alex@xxxxxxxx> wrote:
> >>>>
> >>>> Hi Dmitry,
> >>>>
> >>>> Le 2/16/21 à 6:25 AM, Dmitry Vyukov a écrit :
> >>>>> On Tue, Feb 16, 2021 at 12:17 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> >>>>>>
> >>>>>> On Fri, Jan 29, 2021 at 9:11 AM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> >>>>>>>> I was fixing KASAN support for my sv48 patchset so I took a look at your
> >>>>>>>> issue: I built a kernel on top of the branch riscv/fixes using
> >>>>>>>> https://github.com/google/syzkaller/blob/269d24e857a757d09a898086a2fa6fa5d827c3e1/dashboard/config/linux/upstream-riscv64-kasan.config
> >>>>>>>> and Buildroot 2020.11. I have the warnings regarding the use of
> >>>>>>>> __virt_to_phys on wrong addresses (but that's normal since this function
> >>>>>>>> is used in virt_addr_valid) but not the segfaults you describe.
> >>>>>>>
> >>>>>>> Hi Alex,
> >>>>>>>
> >>>>>>> Let me try to rebuild buildroot image. Maybe there was something wrong
> >>>>>>> with my build, though, I did 'make clean' before doing. But at the
> >>>>>>> same time it worked back in June...
> >>>>>>>
> >>>>>>> Re WARNINGs, they indicate kernel bugs. I am working on setting up a
> >>>>>>> syzbot instance on riscv. If there a WARNING during boot then the
> >>>>>>> kernel will be marked as broken. No further testing will happen.
> >>>>>>> Is it a mis-use of WARN_ON? If so, could anybody please remove it or
> >>>>>>> replace it with pr_err.
> >>>>>>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've localized one issue with riscv/KASAN:
> >>>>>> KASAN breaks VDSO and that's I think the root cause of weird faults I
> >>>>>> saw earlier. The following patch fixes it.
> >>>>>> Could somebody please upstream this fix? I don't know how to add/run
> >>>>>> tests for this.
> >>>>>> Thanks
> >>>>>>
> >>>>>> diff --git a/arch/riscv/kernel/vdso/Makefile b/arch/riscv/kernel/vdso/Makefile
> >>>>>> index 0cfd6da784f84..cf3a383c1799d 100644
> >>>>>> --- a/arch/riscv/kernel/vdso/Makefile
> >>>>>> +++ b/arch/riscv/kernel/vdso/Makefile
> >>>>>> @@ -35,6 +35,7 @@ CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os
> >>>>>> # Disable gcov profiling for VDSO code
> >>>>>> GCOV_PROFILE := n
> >>>>>> KCOV_INSTRUMENT := n
> >>>>>> +KASAN_SANITIZE := n
> >>>>>>
> >>>>>> # Force dependency
> >>>>>> $(obj)/vdso.o: $(obj)/vdso.so
> >>>>
> >>>> What's weird is that I don't have any issue without this patch with the
> >>>> following config whereas it indeed seems required for KASAN. But when
> >>>> looking at the segfaults you got earlier, the segfault address is 0xbb0
> >>>> and the cause is an instruction page fault: this address is the PLT base
> >>>> address in vdso.so and an instruction page fault would mean that someone
> >>>> tried to jump at this address, which is weird. At first sight, that does
> >>>> not seem related to your patch above, but clearly I may be wrong.
> >>>>
> >>>> Tobias, did you observe the same segfaults as Dmitry ?
> >>>
> >>>
> >>> I noticed that not all buildroot images use VDSO, it seems to be
> >>> dependent on libc settings (at least I think I changed it in the
> >>> past).
> >>
> >> Ok, I used uClibc but then when using glibc, I have the same segfaults,
> >> only when KASAN is enabled. And your patch fixes the problem. I will try
> >> to take a look later to better understand the problem.
> >>
> >>> I also booted an image completely successfully including dhcpd/sshd
> >>> start, but then my executable crashed in clock_gettime. The executable
> >>> was build on linux/amd64 host with "riscv64-linux-gnu-gcc -static"
> >>> (10.2.1).
> >>>
> >>>
> >>>>> Second issue I am seeing seems to be related to text segment size.
> >>>>> I check out v5.11 and use this config:
> >>>>> https://gist.github.com/dvyukov/6af25474d455437577a84213b0cc9178
> >>>>
> >>>> This config gave my laptop a hard time ! Finally I was able to boot
> >>>> correctly to userspace, but I realized I used my sv48 branch...Either I
> >>>> fixed your issue along the way or I can't reproduce it, I'll give it a
> >>>> try tomorrow.
> >>>
> >>> Where is your branch? I could also test in my setup on your branch.
> >>>
> >>
> >> You can find my branch int/alex/riscv_kernel_end_of_address_space_v2
> >> here: https://github.com/AlexGhiti/riscv-linux.git
> >
> > No, it does not work for me.
> >
> > Source is on b61ab6c98de021398cd7734ea5fc3655e51e70f2 (HEAD,
> > int/alex/riscv_kernel_end_of_address_space_v2)
> > Config is https://gist.githubusercontent.com/dvyukov/6af25474d455437577a84213b0cc9178/raw/55b116522c14a8a98a7626d76df740d54f648ce5/gistfile1.txt
> >
> > riscv64-linux-gnu-gcc -v
> > gcc version 10.2.1 20210110 (Debian 10.2.1-6+build1)
> >
> > qemu-system-riscv64 --version
> > QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-3)
> >
> > qemu-system-riscv64 \
> > -machine virt -smp 2 -m 2G \
> > -device virtio-blk-device,drive=hd0 \
> > -drive file=image-riscv64,if=none,format=raw,id=hd0 \
> > -kernel arch/riscv/boot/Image \
> > -nographic \
> > -device virtio-rng-device,rng=rng0 -object
> > rng-random,filename=/dev/urandom,id=rng0 \
> > -netdev user,id=net0,host=10.0.2.10,hostfwd=tcp::10022-:22 -device
> > virtio-net-device,netdev=net0 \
> > -append "root=/dev/vda earlyprintk=serial console=ttyS0 oops=panic
> > panic_on_warn=1 panic=86400 earlycon"
>
> It still works for me but I had to disable CONFIG_DEBUG_INFO_BTF (I
> don't think that changes anything at runtime). But your above command
> line does not work for me as it appears you do not load any firmware, if
> I add -bios images/fw_jump.elf, it works. But then I don't know where
> your opensbi output below comes from...
>
> And regarding your issue with calling clock_gettime 'directly' compared
> to using the syscall, I have the same consistent output from both calls.
>
> I have an older gcc (9.3.0) and the same qemu. I think what is missing
> here is your buildroot config, so that we have the exact same
> environment: could you post your buildroot config as well ?

I don't think the image is relevant because I don't even get to kernel
code. If the kernel will complain about no init later, that's fine.
Re bios, this version of qemu already has OpenSBI bios builtin, you
can pass -bios default, but that's, well, the default :)
Here are more reproducible repro instructions that capture gcc and
qemu. I think gcc version may be potentially relevant as I suspect
code size.


curl https://gist.githubusercontent.com/dvyukov/6af25474d455437577a84213b0cc9178/raw/55b116522c14a8a98a7626d76df740d54f648ce5/gistfile1.txt
> $KERNEL_SRC/.config
docker pull gcr.io/syzkaller/syzbot
docker run -it -v $KERNEL_SRC:/kernel gcr.io/syzkaller/syzbot
cd /kernel
make -j72 ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- olddefconfig
make -j72 ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-
qemu-system-riscv64 -machine virt -smp 2 -m 4G -kernel
arch/riscv/boot/Image -nographic -append "earlycon earlyprintk=serial
console=ttyS0"
[this does not, only OpenSBI output]

scripts/config -d KASAN_INLINE -e KASAN_OUTLINE -d
CC_OPTIMIZE_FOR_PERFORMANCE -e CC_OPTIMIZE_FOR_SIZE
make -j72 ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu-
qemu-system-riscv64 -machine virt -smp 2 -m 4G -kernel
arch/riscv/boot/Image -nographic -append "earlycon earlyprintk=serial
console=ttyS0"
[this boots fine, at least at to starting init process]