Re: ptrace vs FSGSBASE

From: Andy Lutomirski
Date: Mon May 02 2016 - 11:38:43 EST

On Mon, May 2, 2016 at 7:27 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> Hi Andy,
> let me first say that I never knew how this code (and the hardware)
> actually works, I am not sure I even understand what ARCH_SET_.S
> exactly does ;)
> What is even worse, I do not understand your question. So it is not
> that I am trying to help, I am asking you to help me understand the
> problem.
> On 04/29, Andy Lutomirski wrote:
>> 1. I read fs_base using ptrace. I think I should get the actual
>> fs_base without any nonsense.
> Which fs_base? The member of user_regs_struct? But this structure/layout
> is just the ABI, so to me it seems correct that getreg() tries to look
> at ->fs and/or ->fsindex.

Yeah, the member of user_regs_struct.

> IOW. getreg(fs) should return the same value as prctl(ARCH_GET_FS)
> returns if called by the tracee, no?

Hah, nice can of worms there. You're assuming that ARCH_GET_FS
actually worked...

>> 2. I read all the regs (PEEKUSER or whatever) and then write then all
>> back verbatim. At the very least, I think that if I do this
>> atomically using PTRACE_SETREGSET, the task's state needs to remain
>> unchanged.
> Agreed... do you mean this doesn't work?

I'm not 100% sure. It probably does right now. See below.

>> Since ptrace doesn't seem to have any real concept of
>> atomic register state changes right now
> Could you spell please?
> I can't understand what does "atomically" mean in this context.

I mean "change fs and fs_base to these two values in a single syscall
so that the kernel can do something intelligent."

Let me give some background:

On 32-bit systems, there are the FS and GS registers. For any value
of FS, there is an implied base address of the FS segment. A debugger
could, if it cared, try to figure out that implied base, except that
no one ever added the API for that. If a debugger read FS and wrote
the same value back to FS, then the process would probably end up in
the same state it started in (modulo several bugs, all but one of
which are now fixed in -tip AFAIK.) All was well.

On current 64-bit systems Linux systems, there is a degree of
independent control of FS and FSBASE. A process can call ARCH_SET_FS
and pass an offset >4G, which will result in FS == 0 and FSBASE ==
whatever the process passed. This is already a bit screwy. Suppose a
debugger writes zero to FS. If this were an actual MOV instruction on
an Intel chip, FSBASE would be reset to zero (and then the context
switch code would corrupt it). But writing zero to FS through ptrace
should have no effect and currently has no effect. If FS != 0, then
FSBASE has some implied value. On old kernels, reading
user_regs_struct::fs_base would give either zero or garbage, depending
on which set of bugs you managed to hit. If you write, say, 0x2b to
fs and 12345 to fs_base using the ptrace API, you'd end up with FS ==
0x2b and FSBASE == 0, because the fs_base write went to an ignored

On Ivy Bridge and up, there's a new CPU feature that lets user code
override FSBASE on its own, making pretty much any combination of FS
and FSBASE possible. But how should this interact with ptrace? If a
debugger sets fs_base = 12345 and *then* sets fs to 0x2b, does the
debugger expect the write to fs to override FSBASE (which it would if
done using MOV) causing FSBASE to reset to zero? Or should FSBASE
actually end up containing 12345? The issue comes up because, on
these newer systems, 0x2b/12345 is actually a reasonable combination
of values, whereas, on older systems, it was not.