Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread

From: Kyle Huey
Date: Tue Apr 07 2020 - 00:53:55 EST


On Mon, Apr 6, 2020 at 9:45 PM Keno Fischer <keno@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, Apr 6, 2020 at 11:58 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >
> >
> > > On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@xxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > ïThis is a follow-up to my from two-years ago [1].
> >
> > Your changelog is missing an explanation of why this is useful. Why would a user program want to change XCR0?
>
> Ah, sorry - I wasn't sure what the convention was around repeating the
> applicable parts from the v1 changelog in this email.
> Here's the description from the v1 patch:
>
> > The rr (http://rr-project.org/) debugger provides user space
> > record-and-replay functionality by carefully controlling the process
> > environment in order to ensure completely deterministic execution
> > of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
> > allows rr to move traces across (Intel) machines, by allowing cpuid
> > invocations to be reliably recorded and replayed. This works very
> > well, with one catch: It is currently not possible to replay a
> > recording from a machine supporting a smaller set of XCR0 state
> > components on one supporting a larger set. This is because the
> > value of XCR0 is observable in userspace (either by explicit
> > xgetbv or by looking at the result of xsave) and since glibc
> > does observe this value, replay divergence is almost immediate.
> > I also suspect that people interested in process (or container)
> > live-migration may eventually care about this if a migration happens
> > in between a userspace xsave and a corresponding xrstor.
> >
> > We encounter this problem quite frequently since most of our users
> > are using pre-Skylake systems (and thus don't support the AVX512
> > state components), while we recently upgraded our main development
> > machines to Skylake.
>
> Basically, for rr to work, we need to tightly control any user-visible
> CPU behavior,
> either by putting in the CPU in the right state or by trapping and emulating
> (as we do for rdtsc, cpuid, etc). XCR0 controls a bunch of
> user-visible CPU behavior,
> namely:
> 1) The size of the xsave region if xsave is passed an all-ones mask
> (which is fairly common)
> 2) The return value of xgetbv

It's mentioned elsewhere, but I want to emphasize that the return
value of xgetbv is the big one because the dynamic linker uses this.
rr trace portability is essentially limited to machines with identical
xcr0 values because of it.

- Kyle

> 3) Whether instructions making use of the relevant xstate component traps
>
> In the v1 review, it was raised that user space could be adjusted to
> deal with these
> issues by always checking support in cpuid first (which is already emulatable).
> Unfortunately, we don't control the environment on the record side (rr supports
> record on any Intel from the past decade - with the exception of a few that have
> microarchitecture bugs causing problems; and kernel versions back to 3.11), so
> trying to patch user space is unfortunately a no-go for us (as well as of course
> being a debugging tool, so we want to be able to help users debug if they get
> uses of these instructions wrong).
>
> Another suggestion in the v1 review was to use a VM instead with an appropriate
> XCR0 value. That does mostly work, but has some problems:
> 1) The performance is quite a bit worse (particularly if we're already
> replaying in a virtualized environment)
> 2) We may want to simultaneously replay tasks with different XCR0
> values. This comes
> into play e.g. when recording a distributed system where different
> nodes in the system
> are on hosts with different hardware configurations (the reason you
> want to replay them
> jointly rather than node-by-node is that this way you can avoid
> recording any intra-node
> communication, since you can just recompute it from the trace).
>
> As a result, doing this will fully-featured VMs isn't an attractive
> proposition. I had looked into
> doing something more light-weight using the raw KVM API or something
> analogous to what project dune did (http://dune.scs.stanford.edu/ -
> basically implementing
> linux user space, but where the threads run in guest CPL0 rather than
> host CPL3).
> My conclusion was that this approach too would require significant
> kernel modification to
> work well (as well as having the noted performance problems in
> virtualized environments).
>
> Sorry if this is too much of an info dump, but I hope this gives some color.
>
> Keno