x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.
But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.
So, on a 512 4KB TLB entries CPU, the balance points is at:
512 * 100ns(assumed TLB refill cost) =
x(TLB flush entries) * 140ns(assumed invlpg cost)
Here, x is about 360, that is about 5/8 of 512 entries.
But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.
+
+#define FLUSHALL_BAR 16