Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use

From: Dave Hansen
Date: Wed Oct 14 2020 - 02:07:29 EST


On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> I have no problem with vmalloc(), but I do have a problem with
> vfree() due to the IPIs that result. We need a cache or something.

This sounds like the kind of thing we should just build into vmalloc()
instead of having a bunch of callers implement their own caches. It
shouldn't be too much of a challenge to have vmalloc() keep a cacheline
or two of stats about common vmalloc() sizes and keep some pools around.

It's not going to be hard to implement caches to reduce vfree()-induced
churn, but I'm having a hard time imaging that it'll have anywhere near
the benefits that it did for stacks. Tasks fundamentally come and go a
*lot*, and those paths are hot.

Tasks who go to the trouble to populate 8k or 64k of register state
fundamentally *can't* come and go frequently. We also (probably) don't
have to worry about AMX tasks doing fork()/exec() pairs and putting
pressure on vmalloc(). Before an app can call out to library functions
to do the fork, they've got to save the state off themselves and likely
get it back to the init state. The fork() code can tell AMX is in the
init state and decline to allocate actual space for the child.

> I have to say: this mechanism is awful. Can we get away with skipping
> the dynamic XSAVES mess entirely? What if we instead allocate
> however much space we need as an array of pages and have one percpu
> contiguous region. To save, we XSAVE(S or C) just the AMX state to
> the percpu area and then copy it. To restore, we do the inverse. Or
> would this kill the modified optimization and thus be horrible?

Actually, I think the modified optimization would survive such a scheme:

* copy page array into percpu area
* XRSTORS from percpu area, modified optimization tuple is saved
* run userspace
* XSAVES back to percpu area. tuple matches, modified optimization
is still in play
* copy percpu area back to page array

Since the XRSTORS->XSAVES pair is both done to the percpu area, the
XSAVE tracking hardware never knows it isn't working on the "canonical"
buffer (the page array).

It seems a bit ugly, but it's not like an 8k memcpy() is *that* painful.
The double dcache footprint isn't super nice, though.

Chang, I think that leaves us with three possibilities:

1. use plain old vmalloc()
2. use vmalloc(), but implement caches for large XSAVE state, either in
front of, or inside the vmalloc() API
3. Use a static per-cpu buffer for XSAVE*/XRSTOR* and memcpy() to/from
a scattered list of pages.

A #4 also comes to mind:

Do something somewhat like kmap_atomic(). Map the scattered pages
contiguously and use XSAVE*/XRSTOR* directly, but unmap immediately
after XSAVE*/XRSTOR*. For ~12k of state, I suspect the ~400-500 cycles
for 3 INVLPG's might beat out a memcpy() of 12k of state.

Global TLB invalidations would not be required.

I have to say, though, that I'd be OK with sticking with plain old
vmalloc() for now.