Re: clock_gettime64 vdso bug on 32-bit arm, rpi-4
From: Arnd Bergmann
Date: Wed May 20 2020 - 16:52:43 EST
On Wed, May 20, 2020 at 7:09 PM Rich Felker <dalias@xxxxxxxx> wrote:
>
> On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote:
> > On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> > > The 05/19/2020 22:31, Arnd Bergmann wrote:
> > > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > > > <adhemerval.zanella@xxxxxxxxxx> wrote:
> > > > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > note: i could not reproduce it in qemu-system with these configs:
> > >
> > > qemu-system-aarch64 + arm64 kernel + compat vdso
> > > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> > > qemu-system-arm + cpu max + 32bit arm kernel
> > >
> > > so i think it's something specific to that user's setup
> > > (maybe rpi hw bug or gcc miscompiled the vdso or something
> > > with that particular linux, i built my own linux 5.6 because
> > > i did not know the exact kernel version where the bug was seen)
> > >
> > > i don't have access to rpi (or other cortex-a53 where i
> > > can install my own kernel) so this is as far as i got.
> >
> > If we have a binary of the kernel that's known to be failing on the
> > hardware, it would be useful to dump its vdso and examine the
> > disassembly to see if it was miscompiled.
>
> OK, OP posted it and I think we've solved this. See
> https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410
Thanks a lot everyone for figuring this out.
> And my analysis:
>
> <@dalias> see what i just found on the tracker
> <@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out the time32 functions in this case
> <@dalias> but not the time64 one
> <@dalias> this looks like a real kernel bug that's not hw-specific except breaking on all hardware where the patching-out is needed
> <@dalias> we could possibly work around it by refusing to use the time64 vdso unless the time32 one is also present
> <@dalias> yep
> <@dalias> so i think we've solved this. the kernel thought it wasnt using vdso anymore because it patched it out
> <@dalias> but it forgot to patch out the time64 one
> <@dalias> so it stopped updating the data needed for vdso to work
As you mentioned in the issue tracker, the patching was meant as
an optimization and missing it for clock_gettime64 was a mistake but
should by itself not have caused incorrect data to be returned.
I would assume that there is another bug that leads to clock_gettime64
not entering the syscall fallback path as it should but instead returning
bogus data.
Here are some more things I found:
- From reading the linux-5.6 code that was tested, I see that a condition
that leads to patching out the clock_gettime() vdso should also lead to
clock_gettime64() falling back to the the syscall after
__arch_get_hw_counter() returns an error, but for some reason that
does not happen. Presumably the presence of the patching meant that
this code path was never much exercised.
A missing 45939ce292b4 ("ARM: 8957/1: VDSO: Match ARMv8 timer in
cntvct_functional()") would explain the problem, if it happened on
linux-5.6-rc7 or earlier. The fix was merged in the final v5.6 though.
- The patching may actually be counterproductive because it means that
clock_gettime(CLOCK_*COARSE, ...) has to go through the system call
when it could just return the time of the last timer tick regardless of the
clocksource.
- We may get bitten by errata handling on 32-bit kernels running on 64-bit
hardware that has errata workaround in arch/arm64 for compat mode
but not in native arm kernels. ARM64_ERRATUM_1418040,
ARM64_ERRATUM_858921 or SUN50I_ERRATUM_UNKNOWN1
are examples of workaround that are not used on 32-bit kernels running
on 64-bit hardware.
Arnd