Re: [PATCH RFC UGLY] x86,mm,sched: make lazy TLB mode even lazier

From: Andy Lutomirski
Date: Tue Aug 30 2016 - 14:23:32 EST


On Mon, Aug 29, 2016 at 6:14 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> On August 29, 2016 4:55:02 PM PDT, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>On Aug 29, 2016 7:54 AM, "Rik van Riel" <riel@xxxxxxxxxx> wrote:
>>>
>>> On Sun, 2016-08-28 at 01:11 -0700, Andy Lutomirski wrote:
>>> > On Aug 25, 2016 9:06 PM, "Rik van Riel" <riel@xxxxxxxxxx> wrote:
>>> > >
>>> > > Subject: x86,mm,sched: make lazy TLB mode even lazier
>>> > >
>>> > > Lazy TLB mode can result in an idle CPU being woken up for a TLB
>>> > > flush, when all it really needed to do was flush %cr3 before the
>>> > > next context switch.
>>> > >
>>> > > This is mostly fine on bare metal, though sub-optimal from a
>>power
>>> > > saving point of view, and deeper C states could make TLB flushes
>>> > > take a little longer than desired.
>>> > >
>>> > > On virtual machines, the pain can be much worse, especially if a
>>> > > currently non-running VCPU is woken up for a TLB invalidation
>>> > > IPI, on a CPU that is busy running another task. It could take
>>> > > a while before that IPI is handled, leading to performance
>>issues.
>>> > >
>>> > > This patch is still ugly, and the sched.h include needs to be
>>> > > cleaned
>>> > > up a lot (how would the scheduler people like to see the context
>>> > > switch
>>> > > blocking abstracted?)
>>> > >
>>> > > This patch deals with the issue by introducing a third tlb state,
>>> > > TLBSTATE_FLUSH, which causes %cr3 to be flushed at the next
>>> > > context switch. A CPU is transitioned from TLBSTATE_LAZY to
>>> > > TLBSTATE_FLUSH with the rq lock held, to prevent context
>>switches.
>>> > >
>>> > > Nothing is done for a CPU that is already in TLBSTATE_FLUH mode.
>>> > >
>>> > > This patch is totally untested, because I am at a conference
>>right
>>> > > now, and Benjamin has the test case :)
>>> > >
>>> >
>>> > I haven't had a chance to seriously read the code yet, but what
>>> > happens when the mm is deleted outright? Or is the idea that a
>>> > reference is held until all the lazy users are gone, too?
>>>
>>> Worst case we send a TLB flush to a CPU that does
>>> not need it.
>>>
>>> As not sending an IPI will be faster than sending
>>> one, I do not think the tradeoff will be much
>>> different for a system with PCID.
>>
>>If we were fully non-lazy, we wouldn't need to send these IPIs at all,
>>right? We would just keep cr3 pointing at swapper_pg_dir when not
>>actively using the mm. The problem with doing that without PCID is
>>that cr3 writes are really slow. Or am I missing something?
>
> Writing cr3 on a PCID system doesn't (necessarily) flush the TLB context. The whole reason for PCIDs is to *enable* lazy TLB by not making it necessary to flush a TLB context during the running of another process. As such, this methodology should help a PCID system even more: we can remember if we need to flush a TLB context during the scheduling of said task, without needing any IPI.

What I mean, more precisely, is: when unusing an mm, if we have PCID,
we could actually switch to swapper_pg_dir without flushing the TLB.
Then, when we resume the old task, we can use the tracking (that I add
in my patches) to decide when to flush them.

I'm not sure this would actually improve matters in any meaningful way.

--Andy