Re: [GIT PULL] arm64 fixes for -rc5
From: Will Deacon
Date: Mon Aug 09 2021 - 06:52:47 EST
Hi Linus,
[+Mark]
On Fri, Aug 06, 2021 at 11:40:33AM -0700, Linus Torvalds wrote:
> On Fri, Aug 6, 2021 at 6:53 AM Will Deacon <will@xxxxxxxxxx> wrote:
> >
> > Please pull these arm64 fixes for -rc5. It's all pretty minor but the
> > main fix is sorting out how we deal with return values from 32-bit system
> > calls as audit expects error codes to be sign-extended to 64 bits
>
> I've pulled this, but that change looks _really_ odd.
Cheers, and yes it does. We're stuck in the middle of the architecture,
the compat ABI and internal kernel expectations. More below.
> First you seem to intentionally *zero-extend* the error value when you
> actually set it in pt_regs, and then you sign-extend them when reading
> them.
>
> So the rules seem entirely arbitrary: oen place says "upper 32 bits
> need to be clear" and another place says "upper 32 bits need to be
> sign-extended".
>
> Why this insanity? Why not make the rule be that the upper 32 bits are
> always just sign-extended?
There are a few things which collide here:
The architecture doesn't guarantee that the upper 32-bits of a 64-bit
general purpose register are preserved across an exception return to a
32-bit task. They _might_ be left intact, but it's up to the CPU whether
you get the value you wrote or all zeroes if you read those bits after
taking an exception back to 64-bit state. Consequently, we can't
expose 64-bit registers for 32-bit tasks via ptrace() as the
resulting behaviour is going to vary based on how the hardware feels.
Maybe we could sign-extend everything on exception entry, but that would
necessitate many more syscall wrappers for compat tasks than we currently
have so we could truncate pointer arguments back down to 32 bits.
Instead, we currently handle this by (a) treating the registers of a 32-bit
task as 32 bits (hence the zero extension when writing the value in
syscall_set_return_value()) and (b) explicitly clearing the upper bits of
x0 on exception entry from a 32-bit task in case we previously leaked a
negative syscall return value in there.
The problem then is that some in-kernel users (e.g. audit and some parts of
ptrace which abstract the syscall return value away from the register state)
_do_ want to see sign-extended syscall return arguments in order to match
against error codes (see is_syscall_success()). So we end up in a situation
where we need to sign-extend the return value for those, whilst leaving the
zero-extended version in the actual pt_regs structure.
It's ugly and subtle because the sky doesn't tend to fall in if you get
it wrong. As you can see, we're still fixing it.
Will