Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread

From: Keno Fischer
Date: Tue Apr 07 2020 - 18:16:34 EST


> The userspace buffer is... a userspace buffer. It is not and should not
> be tied to the format of the kernel buffer.

I don't think I disagree with you. However, I should point out that in
the kernel currently, the layout of this user space buffer changes depending
on the setting of the kernel XCR0 (Is this a good idea? Probably not -
certainly all the users I've looked at get this wrong and assume the
layout is fixed just variably sized, but that is how the kernel currently
works, so we're probably stuck with it)

> > Are you referring to a ptracer which runs with a modified XCR0, and
> > assumes that the value it gets back from ptrace will have an
> > XSTATE_BV equal to its own observed XCR0 and thus get confused about
> > the layout of the buffer (or potentially have not copied all of the
> > relevant xstate because it calculated a wrong buffer size)?
>
> I don't think it's insane for a process to assume that it can XRSTOR a
> buffer that it gets back from ptrace. That seems like something that
> could clearly be an ABI that apps depend on.

Yes, that's fair, but I'm less worried about the case where
the ptracer itself is running with a modified XCR0,
because something had to explicitly opt that
process into that behavior. That piece can verify
that the application works fine, or alternatively fix
the system calls to emulate whatever behavior
that user application wants to build from this primitive.
I also believe that the issue I mentioned above,
where the ptrace xstate buffer is compacted will
cause it to fail to XRSTOR properly depending
on what the kernel XCR0 value is, and it won't XRSTORC
either, because we don't write the XCOMP_BV.

> Also, let's look at the comment about where XCR0 shows up in the ABI
> (arch/x86/include/asm/user.h):
>
> > * For now, only the first 8 bytes of the software usable bytes[464..471] will
> > * be used and will be set to OS enabled xstate mask (which is same as the
> > * 64bit mask returned by the xgetbv's xCR0).
>
> That also makes it sound like we expect there to be a *SINGLE* value
> across the entire system. It also makes me wonder *which* xgetbv is
> expected to match USER_XSTATE_XCR0_WORD. It can't be the ptracee since
> we expect them to change XCR0. It can't be the ptracer because they can
> use this new prctl too. So does it refer to the kernel? Or, should the
> new prctl() *disable* future ptrace()s?

It can't be the ptracee's XCR0, because that would be
breaking to the ptracers, which haven't opted into any
XCR0 modification. I don't think it should be the ptracer's
XCR0, because that would make the ptracer with the
modified XCR0 unable to trace a tracee with the full XCR0,
which, while potentially an acceptable trade-off, seems unnecessary.

> > If so, I think that's just a buggy ptracer. The kernel's xfeature
> > mask is available via ptrace and a well-behaved ptracer should use
> > that (e.g. gdb does, though it looks like it then also assumes that
> > the xstate has no holes, so it potentially gets the layout wrong
> > anyway).
>
> I'm trying to figure out what the semantics are of this whole thing. It
> can't be "don't let userspace observe the real XCR0" because ptrace
> exposes that. Is it, "make memory images portable, unless it's a memory
> image from ptrace"?

The semantics I want are "make userspace instructions behave
in the way that they would if XCR0 was this value". I'm open to
additionally extending this to sigframes, because I can't think of
a situation in which writing all those extra zeros would be useful,
but I'm also ok with the argument that it shouldn't affect any kernel
behavior whatsoever as Andy was suggesting earlier in the thread.

You can build something with more complete semantics on top of it,
as rr would, to more fully emulate the environment. You're probably
gonna be needing to trap CPUID anyway to mask off the relevant
features. I would prefer if the kernel didn't make assumptions here,
and just gave me the minimal primitive to build on top of.

> > In general, I don't really want the modified XCR0 to affect
> > anything other than the particular instructions that depend
> > on it and maybe the signal frame (though as I said before,
> > I'm open to either here).
>
> Just remember that, in the end, we don't get to say what a good ptracer
> or bad ptracer is. If they're expecting semantics that we've kept
> constant for 10 years, we change the semantics, and the app breaks, the
> kernel is in the wrong.

Every ptracer is a bad ptracer - using this API "correctly" is essentially
impossible ;). I certainly agree that we can't change the behavior for
ptracers that haven't opted into XCR0 modification, but I don't think
that imposes any restriction on what a ptracer with modified XCR0
does (well, there's three possible options, and two reasonable ones,
so it should be one of those)

> I also don't feel like I have a good handle on what ptracers *do* with
> their XSAVE buffers that they get/set. How many apps in a distro do
> something with this interface?

Well, anything that's a debugger would: gdb, lldb, rr.
CRIU's compel tool does too.
However, most ptracers probably don't touch the fpu state.
strace doesn't, proot doesn't.
I did some grepping around and the only other ptracer
I could find that was using this interface is
... linux (in um mode).

As a quick survey, gdb reads the kernel
XCR0 from the xstate and uses that to compute
the size of the xsave area (which I think is most correct).
lldb and rr use cpuid, which should be a
fine upper bound (overallocating is fine for ptrace)
assuming nobody is messing with cpuid *cough*,
though of course what just happens here is that they
just won't see the extra state. criu always uses 4096 bytes.
linux um requests the maximum it knows about.
None (including linux-um, oops) of them are aware
that the layout of the ptrace buffer is compressed if
XCR0 has holes in it (I'll leave it up to you to decide
whether that means we should fix it while hardware
with holes in its XCR0 is still rare - though you can
get into this situation with cmdline flags). All those
applications will have incorrect behavior if the kernel
XCR0 has a whole in it (they don't care about the
user XCR0 though, so that's somewhat unrelated
to this PR).

I don't see any compelling reason to hide additional
xstate from a ptracer with modified xcr0 if it asks for it.