Re: [RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3

From: Andy Lutomirski
Date: Wed Jan 13 2016 - 18:36:12 EST


On Fri, Jan 8, 2016 at 6:20 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>
>>> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>>
>> I played with that and it was slower. I don't pretend that makes any sense.
>
> Ugh. I guess reading and writing cr3 has been optimized.
>
>>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>>> use X86_FEATURE_PCID, but that seems fine.
>>
>> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
>> a whole generation of servers like that. I think we should fully
>> support them.
>
> Can you check the timings? IOW, is it a win on SNB?

~80ns gain on SNB. It's actually quite impressive on SNB: it knocks
the penalty for mm switches down to 20ns or so, which I find to be
fairly amazing. (This is at 3.8GHz or thereabouts.)

>
> I think originally Intel only had two actual bits of process context
> ID in the TLB, and it was meant to be used for virtualization or
> something. Together with the hashing (to make it always appear as 12
> bits to software - a nice idea but also means that the hardware ends
> up invalidating more than software really expects), it may not work
> all that well.
>
> That _could_ explain why the original patch from intel didn't work.
>
>> We might be able to get away with just disabling preemption instead of
>> IRQs, at least if mm == active_mm.
>
> I'm not convinced it is all that much faster. Of course, it's nicer on
> non-preempt, but nobody seems to run things that way.

My current testing version has three different code paths now. If
INVPCID and PCID are both available, then it uses INVPCID. If PCID is
available but INVPCID is not, it does raw_local_irqsave. If PCID is
not available, it just does the CR3 read/write.

Yeah, it's ugly, and it's a big blob of code to do something trivial,
but it seems to work and it should be the right thing to do in most
cases.

Can anyone here ask a hardware or microcode person what's going on
with CR3 writes possibly being faster than INVPCID? Is there some
trick to it?

--Andy