Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode

From: Dmitry Safonov
Date: Thu Apr 07 2016 - 11:19:31 EST


On 04/07/2016 05:39 PM, Andy Lutomirski wrote:
On Apr 7, 2016 5:12 AM, "Dmitry Safonov" <dsafonov@xxxxxxxxxxxxx> wrote:
On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
[cc Dave Hansen for MPX]

On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@xxxxxxxxxxxxx> wrote:
Now each process that runs natively on x86_64 may execute 32-bit code
by proper setting it's CS selector: either from LDT or reuse Linux's
USER32_CS. The vice-versa is also valid: running 64-bit code in
compatible task is also possible by choosing USER_CS.
So we may switch between 32 and 64 bit code execution in any process.
Linux will choose the right syscall numbers in entries for those
processes. But it still will consider them native/compat by the
personality, that elf loader set on launch. This affects i.e., ptrace
syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
according to process's mode (that's how strace detect task's
personality from 4.8 version).

This patch adds arch_prctl calls for x86 that make possible to tell
Linux kernel in which mode the application is running currently.
Mainly, this is needed for CRIU: restoring compatible & native
applications both from 64-bit restorer. By that reason I wrapped all
the code in CONFIG_CHECKPOINT_RESTORE.
This patch solves also a problem for running 64-bit code in 32-bit elf
(and reverse), that you have only 32-bit elf vdso for fast syscalls.
When switching between native <-> compat mode by arch_prctl, it will
remap needed vdso binary blob for target mode.
General comments first:
Thanks for your comments.
You forgot about x32.
Will add x32 support for v2.

I think that you should separate vdso remapping from "personality".
vdso remapping should be available even on native 32-bit builds, which
means that either you can't use arch_prctl for it or you'll have to
wire up arch_prctl as a 32-bit syscall.
I cant say, I got your point. Do you mean by vdso remapping
mremap for vdso/vvar pages? I think, it should work now.
For 32-bit, the vdso *must* exist in memory at the address that the
kernel thinks it's at. Even if you had a pure 32-bit restore stub,
you would still need vdso remap, because there's a chance the vdso
could land at an unusable address, say one page off from where you
want it. You couldn't map a wrapper because there wouldn't be any
space for it without moving the real vdso out of the way.

Remember, you *cannot* mremap() the 32-bit vdso because you will
crash. It works by luck for 64-bit, but it's plausible that we'd want
to change that some day. (I have awful patches that speed a bunch of
things up at the cost of a vdso trampoline for 64-bit code and a bunch
of other hacks. Those patches will never go in for real, but
something else might want the ability to use 64-bit vdso trampolines.)
Thanks for the elaboration, now I see. Signals and fast syscalls
expect mm->context.vdso to be correct.

I did remapping for vdso as blob for native x86_64 task differs
to compatible task. So it's just changing blobs, address value
is there for convenience - I may omit it and just remap
different vdso blob at the same place where was previous vdso.
I'm not sure, why do we need possibility to map 64-bit vdso blob
on native 32-bit builds?
That would fail, but I think the API should exist. But a native
32-bit program should be able to remap the 32-bit vdso.

IOW, I think you should be able to do, roughly:

map_new_vdso(VDSO_32BIT, addr);

on any kernel.

Am I making sense?
Yes. I will rework it for some API.

For "personality", someone needs to enumerate all of the various thigs
that try to track bitness and see how many of them even make sense.
On brief inspection:

- TIF_IA32: affects signal format and does something to ptrace. I
suspect that whatever it does to ptrace is nonsensical, and I don't
know whether we're stuck with it.

- TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
isn't even done in a sensible way).

- is_64bit_mm affects MPX and uprobes.

On even more brief inspection:

- uprobes using is_64bit_mm is buggy.

- I doubt that having TASK_SIZE vary serves any purpose. Does anyone
know why TASK_SIZE is different for different tasks? It would save
code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
- Using TIF_IA32 for signal processing is IMO suboptimal. Instead,
we should record which syscall installed the signal handler and use
the corresponding frame format.
Oh, I like it, will do.

- Using TIF_IA32 of the *target* for ptrace is nonsense. Having
strace figure out syscall type using that is actively buggy, and I ran
into that bug a few days ago and cursed at it. strace should inspect
TS_COMPAT (I don't know how, but that's what should happen). We may
be stuck with this for ABI reasons.
ptrace may check seg_32bit for code selector, what do you think?
Not sure. I have never fully wrapped my had around ptrace.
Hm, I guess, it's better to check TS_COMPAT, after some thinking:
It's set up on compatible syscall enter, so there is no need to
check seg_32bit anyway.

Huge thanks, will work on v2 according to your comments.