Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread

From: Peter Zijlstra
Date: Tue Apr 07 2020 - 08:34:02 EST


On Mon, Apr 06, 2020 at 09:53:40PM -0700, Kyle Huey wrote:
> On Mon, Apr 6, 2020 at 9:45 PM Keno Fischer <keno@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Mon, Apr 6, 2020 at 11:58 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > >
> > >
> > > > On Apr 6, 2020, at 6:13 PM, Keno Fischer <keno@xxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > ïThis is a follow-up to my from two-years ago [1].
> > >
> > > Your changelog is missing an explanation of why this is useful. Why would a user program want to change XCR0?
> >
> > Ah, sorry - I wasn't sure what the convention was around repeating the
> > applicable parts from the v1 changelog in this email.
> > Here's the description from the v1 patch:
> >
> > > The rr (http://rr-project.org/) debugger provides user space
> > > record-and-replay functionality by carefully controlling the process
> > > environment in order to ensure completely deterministic execution
> > > of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
> > > allows rr to move traces across (Intel) machines, by allowing cpuid
> > > invocations to be reliably recorded and replayed. This works very
> > > well, with one catch: It is currently not possible to replay a
> > > recording from a machine supporting a smaller set of XCR0 state
> > > components on one supporting a larger set. This is because the
> > > value of XCR0 is observable in userspace (either by explicit
> > > xgetbv or by looking at the result of xsave) and since glibc
> > > does observe this value, replay divergence is almost immediate.
> > > I also suspect that people interested in process (or container)
> > > live-migration may eventually care about this if a migration happens
> > > in between a userspace xsave and a corresponding xrstor.
> > >
> > > We encounter this problem quite frequently since most of our users
> > > are using pre-Skylake systems (and thus don't support the AVX512
> > > state components), while we recently upgraded our main development
> > > machines to Skylake.
> >
> > Basically, for rr to work, we need to tightly control any user-visible
> > CPU behavior,
> > either by putting in the CPU in the right state or by trapping and emulating
> > (as we do for rdtsc, cpuid, etc). XCR0 controls a bunch of
> > user-visible CPU behavior,
> > namely:
> > 1) The size of the xsave region if xsave is passed an all-ones mask
> > (which is fairly common)
> > 2) The return value of xgetbv
>
> It's mentioned elsewhere, but I want to emphasize that the return
> value of xgetbv is the big one because the dynamic linker uses this.
> rr trace portability is essentially limited to machines with identical
> xcr0 values because of it.

I'm thinking just exposing that value is doable in a much less
objectionable fashion, no?