Re: [RFC PATCH v2] x86/arch_prctl: Add ARCH_SET_XCR0 to set XCR0 per-thread

From: Dave Hansen
Date: Tue Apr 07 2020 - 10:07:02 EST


On 4/7/20 5:21 AM, Peter Zijlstra wrote:
> You had a fairly long changelog detailing what the patchd does; but I've
> failed to find a single word on _WHY_ we want to do any of that.

The goal in these record/replay systems is to be able to recreate thee
exact same program state on two systems at two different times. To make
it reasonably fast, they try to minimize the number of snapshots they
have to take and avoid things like single stepping.

So, there are some windows where they just let the CPU run and don't
bother with taking any snapshots of register state, for instance. Let's
say you read a word from shared memory, multiply it and shift it around
some registers, then stick it back in shared memory. Most of these
things will just a record the snapshot at the memory read and assume
that all the instructions in the middle execute deterministically. That
eliminates a ton of snapshots.

But, what if an instruction in the middle isn't deterministic between
two machines. Let's say you record a trace on a a Broadwell system,
then try to replay it on a Skylake, and one of the non-snapshotted
instructions is xgetbv. Skylake added MPX, so xgetbv will return
different values. Your replay diverges from what was "recorded", and
life sucks.

Same problem exists for CPUID, but that was hacked around in another set.

I'm also trying to think of what kinds of things CPU companies add to
their architectures that would break this stuff. I can't recall ever
having a discussion with folks at Intel where we're designing a CPU
feature and we say, "Can't do that, it would break record/replay". I
suspect there are more of these landmines around and I bet that we're
building more of them into CPUs every day.