Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
From: Rik van Riel
Date: Tue Jul 17 2018 - 18:05:20 EST
> On Jul 17, 2018, at 5:29 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>
> On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>> Can I skip both the cr4 and let switches when the TLB contents
>> are no longer valid and got reloaded?
>>
>> If the TLB contents are still valid, either because we never went
>> into lazy TLB mode, or because no invalidates happened while
>> we were lazy, we immediately return.
>>
>> The cr4 and ldt reloads only happen if the TLB was invalidated
>> while we were in lazy TLB mode.
>
> Yes, since the only events that would change the LDT or the required
> CR4 value will unconditionally broadcast to every CPU in mm_cpumask
> regardless of whether they're lazy. The interesting case is that you
> go lazy, you miss an invalidation IPI because you were lazy, then you
> go unlazy, notice the tlb_gen change, and flush. If this happens, you
> know that you only missed a page table update and not an LDT update or
> a CR4 update, because the latter would have sent the IPI even though
> you were lazy. So you should skip the CR4 and LDT updates.
>
> I suppose a different approach would be to fix the issue below and to
> try to track when the LDT actually needs reloading. But that latter
> part seems a bit complicated for minimal gain.
>
> (Do you believe me? If not, please argue back!)
>
I believe you :)
>>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to
>>> fix that part or should I?
>>
>> I would be happy to send in a patch for this, and one for
>> the above optimization you pointed out.
>>
>
> Yes please!
>
There is a third optimization left to do. Currently every time
we switch into lazy tlb mode, we take a refcount on the mm,
even when switching from one kernel thread to another, or
when repeatedly switching between the same mm and kernel
threads.
We could keep that refcount (on a per cpu basis) from the time
we first switch to that mm in lazy tlb mode, to when we switch
the CPU to a different mm.
That would allow us to not bounce the cache line with the
mm_struct reference count on every lazy TLB context switch.
Does that seem like a reasonable optimization?
Am I overlooking anything?
I'll try to get all three optimizations working, and will run them
through some testing here before posting upstream.