Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

From: Keno Fischer
Date: Mon Jun 18 2018 - 10:43:43 EST


Hi Dave,

thank you for your thorough comments, replies inline:

On Mon, Jun 18, 2018 at 8:47 AM, Dave Hansen
<dave.hansen@xxxxxxxxxxxxxxx> wrote:
> On 06/16/2018 05:33 PM, Keno Fischer wrote:
>> For my use case, it would be sufficient to simply disallow
>> any value of XCR0 with "holes" in it,
> But what if the hardware you are migrating to/from *has* holes?

Yes, that would be the problem ;). I'm not aware of that being the
case for user space states on current hardware, but perhaps,
I'm just missing a case.

> There's no way this is even close to viable until it has been made to
> cope with holes.

Ok, it seems to me the first decision is whether it should be allowed
to have the compacted format (with holes) in an in-memory xstate
image. Doing so seems possible, but would require more invasive
changes to the kernel, so I'm gonna ignore it for now.

If we want to keep the in-memory xstate layout constant after boot
time, I see four ways to do that for this feature.

1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors
use an XCR0 that matches the layout the kernel expects.
2) Use xsaveopt when this particular thread flag is set and have the
kernel be able to deal with non-compressed xstate images, even
if xsaves is available.
3) What's in this patch now, but fix up the xstate after saving it.
4) Catch the fault thrown by xsaves/xrestors in this situation, update
XCR0, redo the xsaves/restores, put XCR0 back and continue
execution after the faulting instruction.

Option 1) seems easiest, but it would also require adding code
somewhere on the common path, which I assume is a no-go ;).

Option 3) seems doable entirely in the slow path for this feature.
If we xsaves with the modified XCR0, we can then fix up the memory
image to match the expected layout. Similarly, we could xrestors
in the slow path (from standard layout) and rely on the
`fpregs_state_valid` mechanism to prevent the fault.

Option 4) seems also doable, but of course messing around with
traps makes things quite a bit more complicated.

> FWIW, I just don't think this is going to be viable. I have the feeling
> that there's way too much stuff that hard-codes assumptions about XCR0
> inside the kernel and out. This is just going to make it much more fragile.
>
> Folks that want this level of container migration are probably better
> off running one of the hardware-based containers and migrating _those_.
> Or, just ensuring the places to/from they want to migrate have a
> homogeneous XCR0 mix.

Fair enough :). I figured I'd throw it out there as an RFC and see if I can
get it into an acceptable shape for you to include upstream. For my,
particular use case (replaying old traces on newer hardware),
since I control the replay hardware, I'm happy if there's a patch
for me to apply locally to solve my problem. Of course there's also
the opposite use case (recording a trace on a new machine, but
wanting to replay on an older one), which would also require the
recording machine to have this patch. The reason I make this
distinction is that we don't usually control the recording machine
(since those are our users' machines), so that tends to have a
much more varied hardware/kernel mix and it's harder to get kernel
patches applied. I'm happy to keep chipping away at this for as
long as there is feedback, but I understand if it won't make it
in at the end of the process.

>> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
>> /* If cpuid was previously disabled for this task, re-enable it. */
>> if (test_thread_flag(TIF_NOCPUID))
>> enable_cpuid();
>> + if (test_thread_flag(TIF_MASKXCR0))
>> + reset_xcr0_mask();
>> }
>
> So the mask is cleared on exec(). Does that mean that *every*
> individual process using this interface has to set up its own mask
> before anything in the C library establishes its cached value of XCR0.
> I'd want to see how that's being accomplished.

Yes, that is correct. In my rr patch this is done by injecting an
arch_prctl syscall using ptrace, at the same spot that we inject
the arch_prctl syscall for enabling cpuid faulting (i.e. the interface
that sets the `TIF_NOCPUID` flag in the check prior to this one).
Since these two features would seem to make the most sense when
used together, I figured it would be best to align semantics.
Happy to entertain alternative suggestions though.

>> +static int xstate_is_initial(unsigned long mask)
>> +{
>> + int i, j;
>> + unsigned long max_bit = __ffs(mask);
>> +
>> + for (i = 0; i < max_bit; ++i) {
>> + if (mask & (1 << i)) {
>> + char *xfeature_addr = (char *)get_xsave_addr(
>> + &current->thread.fpu.state.xsave,
>> + 1 << i);
>> + unsigned long feature_size = xfeature_size(i);
>> +
>> + for (j = 0; j < feature_size; ++j) {
>> + if (xfeature_addr[j] != 0)
>> + return 0;
>> + }
>> + }
>> + }
>> + return 1;
>> +}
>
> There is nothing architectural saying that the init state has to be 0.

True, but it was my understanding that this was true for all currently
defined state components (unless I misread 13.6 in the SDM, which
is totally possible). Since I don't currently allow setting any bits
other than the ones currently defined, I think this is ok at the moment,
but this function is certainly one of those that make this patch "RFC" ;)

>> + case ARCH_SET_XCR0: {
>
> The interface is a mit burky. The SET_XCR0 operation masks out the
> "set" value from the current value? That's a bit counterintuitive.

I apologize for the confusion. The SET_XCR0, userspace ABI takes
the value of XCR0 to set, which then gets translated into a mask
for the kernel to save (I discussed this a bit in the original PATCH -
I guess this confusion further proves that I should just stick
with one or the other).

>> + unsigned long mask = xfeatures_mask & ~arg2;
>> +
>> + if (!use_xsave())
>> + return -ENODEV;
>> +
>> + if (arg2 & ~xfeatures_mask)
>> + return -ENODEV;
>
> This is rather unfortunately comment-free. "Are you trying to clear a
> bit that was not set in the first place?"
>
> Also, shouldn't this be dealing with the new task->xcr0, *not* the
> global xfeatures_mask? What if someone calls this more than once?

The intention of this was to disallow setting any bits in xcr0, that weren't
enabled in the `xfeatures_mask` (because that means the kernel doesn't
know about them). I was fine with re-enabling xcr0 bits later, which is
why this doesn't currently look at the task's xcr0. I personally don't need
this, but it seemed like an arbitrary choice to disallow it.

>> + if (!xcr0_is_legal(arg2))
>> + return -EINVAL;
>
> FWIW, I don't really get the point of disallowing some of the values
> made illegal in there. Sure, you shoot yourself in the foot, but the
> worst you'll probably see is a general-protection-fault from the XSETBV,
> or from the first XRSTOR*. We can cope with those, and I'd rather not
> be trying to keep a list of things you're not allowed to do with XSAVE.

That's fair. I guess I am scared of creating places in the kernel that trap,
perhaps unreasonably so ;). I should point out however, that KVM already
does keep such a list for precisely the same reason, so if I refactor
this to share such code, at least there'd be only one such list, which
may be slightly better?

> I also don't see any sign of checking for supervisor features anywhere.

At least currently, it is my understanding that `xfeatures_mask` only has
user features, am I mistaken about that?

>> + /*
>> + * We require that any state components being disabled by
>> + * this prctl be currently in their initial state.
>> + */
>> + if (!xstate_is_initial(mask))
>> + return -EPERM;
>
> Aside: I would *not* refer to the "initial state", for fear that we
> could confuse it with the hardware-defined "init state". From software,
> we really have zero control over when the hardware is in its "init state".

I meant to have this refer to the initial configuration as defined in 13.6
of the SDM, i.e. what xrestor restores when the corresponding xstate_bv
field is cleared.

> But, in any case, so how is this supposed to work?
>
> // get features we are disabling into values matching the
> // hardware "init state".
> __asm__("XRSTOR %reg1,%reg2", ...);
> prctl(PRCTL_SET_XCR0, something);
>
> ?

I was primarily thinking of the ptracer use case,

ptrace(PTRACE_SETFPXREGS, <recorded regs>)
<inject arch_prctl using ptrace>

in which case there isn't a problem, because the unrecorded regs
should be in the initial state.

> That would be *really* fragile code from userspace. Adding a printk()
> between those two lines would probably break it, for instance.
>
> I'd probably just not have these checks.

Ok, that is fine with me. It also to some extent informs the possibilities
for the compacted storage question above, since not having these
checks would require us to save the current register state such that
we can re-activate it later and restore it properly.