Re: [RFC PATCH 1/2] x86/ibpb: Skip IBPB when we switch back to same user process

From: Arjan van de Ven
Date: Thu Jan 25 2018 - 09:07:14 EST


On 1/25/2018 5:50 AM, Peter Zijlstra wrote:
On Thu, Jan 25, 2018 at 05:21:30AM -0800, Arjan van de Ven wrote:

This means that 'A -> idle -> A' should never pass through switch_mm to
begin with.

Please clarify how you think it does.


the idle code does leave_mm() to avoid having to IPI CPUs in deep sleep states
for a tlb flush.

The intel_idle code does, not the idle code. This is squirreled away in
some driver :/

afaik (but haven't looked in a while) acpi drivers did too

(trust me, that you really want, sequentially IPI's a pile of cores in a deep sleep
state to just flush a tlb that's empty, the performance of that is horrific)

Hurmph. I'd rather fix that some other way than leave_mm(), this is
piling special on special.

the problem was tricky. but of course if something better is possible lets figure this out

problem is that an IPI to an idle cpu is both power inefficient and will take time,
exit of a deep C state can be, say 50 to 100 usec range of time (it varies by many things, but
for abstractly thinking about the problem one should generally round up to nice round numbers)

if you have say 64 cores that had the mm at some point, but 63 are in idle, the 64th
really does not want to IPI each of those 63 serially (technically this is does not need
to be serial but IPI code is tricky, some things end up serializing this a bit)
to get the 100 usec hit 63 times. Actually, even if it's not serialized, even ONE hit of 100 usec
is unpleasant.

so a CPU that goes idle wants to "unsubscribe" itself from those IPIs as general objective.

but not getting flush IPIs is only safe if the TLBs in the CPU have nothing that such IPI would
want to flush, so the TLB needs to be empty of those things.

the only way to do THAT is to switch to an mm that is safe; a leave_mm() does this, but I'm sure other
options exist.

note: While a CPU that is in a deeper C state will itself flush the TLB, you don't know if you will actually
enter that deep at the time of making OS decisions (if an interrupt comes in the cycle before mwait, mwait
becomes a nop for example). In addition, once you wake up, you don't want the CPU to go start filling
the TLBs with invalid data so you can't really just set a bit and flush after leaving idle