Re: [PATCH 0/3] TLB flush multiple pages per IPI v5

From: Ingo Molnar
Date: Mon Jun 08 2015 - 15:52:51 EST

Next message: Guenter Roeck: "Re: randconfig build error with next-20150529, in crypto/jitterentropy.c"
Previous message: David Rientjes: "Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured"
In reply to: Dave Hansen: "Re: [PATCH 0/3] TLB flush multiple pages per IPI v5"
Next in thread: Ingo Molnar: "Re: [PATCH 0/3] TLB flush multiple pages per IPI v5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Dave Hansen <dave.hansen@xxxxxxxxx> wrote:

> On 06/08/2015 10:45 AM, Ingo Molnar wrote:
> > As per my measurements the __flush_tlb_single() primitive (which you use in patch
> > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2
> > pages and gets exponentially worse. It's probably done in microcode and its
> > performance is horrible.
>
> I discussed this a bit in commit a5102476a2. I'd be curious what
> numbers you came up with.

... which for those of us who don't have sha1's cached in their brain is:

a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)")

;-)

So what I measured agrees generally with the comment you added in the commit:

+ * Each single flush is about 100 ns, so this caps the maximum overhead at
+ * _about_ 3,000 ns.

Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity!

A CR3 driven TLB flush takes less time than a single INVLPG (!):

[ 0.389028] x86/fpu: Cost of: __flush_tlb() fn : 96 cycles
[ 0.405885] x86/fpu: Cost of: __flush_tlb_one() fn : 260 cycles
[ 0.414302] x86/fpu: Cost of: __flush_tlb_range() fn : 404 cycles

it's true that a full flush has hidden costs not measured above, because it has
knock-on effects (because it drops non-global TLB entries), but it's not _that_
bad due to:

- there almost always being a L1 or L2 cache miss when a TLB miss occurs,
which latency can be overlaid

- global bit being held for kernel entries

- user-space with high memory pressure trashing through TLBs typically

... and especially with caches and Intel's historically phenomenally low TLB
refill latency it's difficult to measure the effects of local TLB refills, let
alone measure it in any macro benchmark.

Cross-CPU flushes are expensive, absolutely no argument about that - my suggestion
here is to keep the batching but simplify it: because I strongly suspect that the
biggest win is the batching, not the pfn queueing.

We might even win a bit more performance due to the simplification.

> But, don't we have to take in to account the cost of refilling the TLB in
> addition to the cost of emptying it? The TLB size is historically increasing on
> a per-core basis, so isn't this refill cost only going to get worse?

Only if TLB refill latency sucks - but Intel's is very good and AMD's is pretty
good as well.

Also, usually if you miss the TLB you miss the cache line as well (you definitely
miss the L1 cache, and TLB caches are sized to hold a fair chunk of your L2
cache), and the CPU can overlap the two latencies.

So while it might sound counter-intuitive, a full TLB flush might be faster than
trying to do software based TLB cache management ...

INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as
clear-cut as it might look.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Guenter Roeck: "Re: randconfig build error with next-20150529, in crypto/jitterentropy.c"
Previous message: David Rientjes: "Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured"
In reply to: Dave Hansen: "Re: [PATCH 0/3] TLB flush multiple pages per IPI v5"
Next in thread: Ingo Molnar: "Re: [PATCH 0/3] TLB flush multiple pages per IPI v5"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]