Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

From: Keno Fischer
Date: Mon Jun 18 2018 - 14:16:45 EST


> So, to be useful, this interface needs to be called before an
> application can run XGETBV or XSAVE for the first time and caches a
> "bad" value. I think that means that it might not be feasible to use
> outside of cases where you ptrace() something and inject things before
> it has a chance to run any real instructions.
>
> Fundamentally, I think that makes _this_ interface pretty useless in
> practice. The only practical option is to have a _future_ XCR0 value
> set by the prctl() and then have it get made active by the kernel at
> execve().

Fair enough, but it don't see this as really fundamentally different
from the cpuid masking use case, which has the same problem and
the same interface. I'm also not convinced that there is *no* use case
where one may want to turn on certain XCR0 features while the process
is running and then turn them off again. To give a concrete example in
this context, it can useful to write a small program into the memory space
of the replayed program and use it to analyze the memory state of the
program (e.g. to checksum the memory - or in our case to perform a
GC state validation). Such implants may want to use the AVX512
registers for performance, so it would be nice if that was possible.

> IMNHO, if you haven't guessed yet, I think this whole exercise is a dead
> end. Just boot an identical XCR0 VM on your new hardware and do replay
> there. Done.

I had a hunch ;). However, there are a couple considerations that
make me still want this in the kernel proper:
1. The recording side application of this feature - getting our users
to do everything in a VM to send us a recording is not easy. Part
of the appeal of rr over VM-based record/replay techniques
is that it "just works" on basically any linux hosts.
2. Starting a VM generally requires root permissions, which may
not be available.
3. And probably the biggest from my perspective is performance. rr
needs to do a lot twiddling with the performance counters, which
I've seen have significant performance overhead in a virtualized
environment. There's of course also a per-VM resource consumption,
but presumably we could keep one VM per-XCR0 value and replay
multiple traces per VM.