Re: in_compat_syscall() on x86

From: Eric W. Biederman
Date: Mon Jan 04 2021 - 19:49:18 EST


Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:

>> On Jan 4, 2021, at 2:36 PM, David Laight <David.Laight@xxxxxxxxxx> wrote:
>>
>> From: Eric W. Biederman
>>> Sent: 04 January 2021 20:41
>>>
>>> Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:
>>>
>>>> On Mon, Jan 04, 2021 at 12:16:56PM +0000, David Laight wrote:
>>>>> On x86 in_compat_syscall() is defined as:
>>>>> in_ia32_syscall() || in_x32_syscall()
>>>>>
>>>>> Now in_ia32_syscall() is a simple check of the TS_COMPAT flag.
>>>>> However in_x32_syscall() is a horrid beast that has to indirect
>>>>> through to the original %eax value (ie the syscall number) and
>>>>> check for a bit there.
>>>>>
>>>>> So on a kernel with x32 support (probably most distro kernels)
>>>>> the in_compat_syscall() check is rather more expensive than
>>>>> one might expect.
>>>
>>> I suggest you check the distro kernels. I suspect they don't compile in
>>> support for x32. As far as I can tell x32 is an undead beast of a
>>> subarchitecture that just enough people use that it can't be removed,
>>> but few enough people use it likely has a few lurking scary bugs.
>>
>> It is defined in the Ubuntu kernel configs I've got lurking:
>> Both 3.8.0-19_generic (Ubuntu 13.04) and 5.4.0-56_generic (probably 20.04).
>> Which is probably why it is in my test builds (I've just cut out
>> a lot of modules).

Interesting. That sounds like something a gentle prod to the Ubuntu
kernel team might get them to disable. Especially if there are not any
x32 binaries in sight.

Maybe Ubuntu has a reason or maybe someone just enabled the option
because it was there and they could.

>>>>> It would be muck better if both checks could be done together.
>>>>> I think this would require the syscall entry code to set a
>>>>> value in both the 64bit and x32 entry paths.
>>>>> (Can a process make both 64bit and x32 system calls?)
>>>>
>>>> Yes, it bloody well can.
>>>>
>>>> And I see no benefit in pushing that logics into syscall entry,
>>>> since anything that calls in_compat_syscall() more than once
>>>> per syscall execution is doing the wrong thing. Moreover,
>>>> in quite a few cases we don't call the sucker at all, and for
>>>> all of those pushing that crap into syscall entry logics is
>>>> pure loss.
>>>
>>> The x32 system calls have their own system call table and it would be
>>> trivial to set a flag like TS_COMPAT when looking up a system call from
>>> that table. I expect such a change would be purely in the noise.
>>
>> Certainly a write of 0/1/2 into a dirtied cache line of 'current'
>> could easily cost absolutely nothing.
>> Especially if current has already been read.
>>
>> I also wondered about resetting it to zero when an x32 system call
>> exits (rather than entry to a 64bit one).
>>
>> For ia32 the flag is set (with |=) on every syscall entry.
>> Even though I'm pretty sure it can only change during exec.
>
> It can change for every syscall. I have tests that do this.
>
>>>> What's the point, really?
>>>
>>> Before we came up with the current games with __copy_siginfo_to_user
>>> and x32_copy_siginfo_to_user I was wondering if we should make such
>>> a change. The delivery of compat signal frames and core dumps which
>>> do not go through the system call entry path could almost benefit from
>>> a flag that could be set/tested when on those paths.
>>
>> For signal delivery it should (probably) depend on the system call
>> that setup the signal handler.
>
> I think it has worked this way for some time now.

It always has, but there is code that called as part of signal delivery
that needs to know if it is ia32 or x32 code (namely
copy_siginfo_to_user32). The code paths are short enough we don't
strictly need the runtime test on that path and we have been able to
remove it, but it is an example of the kind of path that is not a
syscall entry where it would be nice to set the flag.

>> Although I'm sure I remember one kernel where some of it was done
>> in libc (with a single entrypoint for all hadlers).
>>
>>> The fact that only SIGCHLD (which can not trigger a coredump) is
>>> different saves the coredump code from needing such a test.
>>>
>>> The fact that the signal frame code is simple enough it can directly
>>> call x32_copy_siginfo_to_user or __copy_siginfo_to_user saves us there.
>>>
>>> So I don't think we have any cases where we actually need a flag that
>>> is independent of the system call but we have come very close.
>>
>> If a program can do both 64bit and x32 system calls you probably
>> need to generate a 64bit core dump if it has ever made a 64bit
>> system call??
>
> I think core dump should (and does) depend on the execution mode at
> the time of the crash.

The core dump code is currently tied to what binary you exec.
The code in exec sets mm->binfmt, and the coredump code uses mm->binfmt
to pick the coredump handler.

An x32 binary will make all kinds of 64bit calls where it doesn't need
the compat handling. And of course x32 binaries run in 64bit mode with
32bit pointers so looking at the current execution mode doesn't help.

Further fun compat_binfmt_elf is shared between x32 and ia32, because
except for a few stray places they do exactly the same thing.

It is lucky that except for SIGCHLD the signals are between x32 and ia32
are exactly the same so that the kernel can encode them exactly the same
way.

> It’s worth noting that GCC’s understanding of mixed bitness is horrible.

Eric