Re: Rethinking sigcontext's xfeatures slightly for PKRU's benefit?
From: Andy Lutomirski
Date: Fri Jul 01 2016 - 12:31:17 EST
On Jun 30, 2016 2:25 PM, "Dave Hansen" <dave.hansen@xxxxxxxxxxxxxxx> wrote:
>
> On 06/30/2016 10:36 AM, Andy Lutomirski wrote:
> >>> We make baseline_pkru a process-wide baseline and store it in
> >>> mm->context. That way, no matter which thread gets interrupted for a
> >>> signal, they see consistent values. We only write to it when an app
> >>> _specifically_ asks for it to be updated with a special flag to
> >>> sys_pkey_set().
> >>>
> >>> When an app uses the execute-only support, we implicitly set the
> >>> read-disable bit in baseline_pkru for the execute-only pkey.
> ...
> > Looking at your git tree, which I assume is a reasonably approximation
> > of your current patches, this seems to be unimplemented. I, at least,
> > would be nervous about using PKRU for protection of critical data if
> > signal handlers are unconditionally exempt.
>
> I actually went along and implemented this using an extra 'flag' for
> pkey_get/set(). I just left it out of this stage since I'm having
> enough problems getting it in with the existing set of features. :)
>
> I'm confident we can add this later with the flags we can pass to
> pkey_get() and pkey_set().
>
> > Also, the lazily allocated no-read key for execute-only is done in the
> > name of performance, but it results in odd semantics. How much of a
> > performance win is preserving the init optimization of PKRU in
> > practice? (I.e. how much faster are XSAVE and XRSTOR?) I can't test
> > because even my Skylake laptop doesn't have PKRU.
>
> This is admittedly not the most realistic benchmark because everything
> is cache-warm, but I ran Ingo's FPU "measure.c" code on XSAVES/XRSTORS.
> This runs things in pretty tight loops where everything is cache hot.
>
> The XSAVE instructions are monsters and I'm not super-confident in my
> measurements, but I'm seeing in the neighborhood of XSAVES/XRSTORS
> getting 20-30 cycles when PKRU is in play vs. not. This is with
> completely cache-hot data, though.
That's surprisingly bad, albeit negligible in the grand scheme of
context switches.
But maybe we could optimize differently. When switching states, mask
out PKRU if it matches between the two states. This could be messy,
but, if we switch to using WRPKRU directly some day, then the init
optimization becomes moot and this optimization becomes easy.
Hmm. If we switch to WRPKRU directly and *always* mask out PKRU,
maybe a bunch of your code gets simpler because you won't have to poke
around in the saved state.
Looking forward, I think the xstate partitions into at least three kinds:
1. Pure user state (FPU, etc)
2. User-affecting XSAVES-only state (the CET mess, etc).
3. User accessible state that is needed in user mode *and* kernel
mode. PKRU is like this.
Type 3 state needs to be switched eagerly. Types 1 and 2 need to be
*saved* eagerly but not necessarily loaded eagerly. I still want to
lazy-load it some day.
Do you happen to have WRPKRU cycle counts you can share?
--Andy