x86 TLB flushing: INVPCID vs. deferred CR3 write

From: Dave Hansen
Date: Tue Dec 05 2017 - 20:27:37 EST


tl;dr: Kernels with pagetable isolation using INVPCID compile kernels
0.58% faster than using the deferred CR3 write. This tends to say that
we should leave things as-is and keep using INVPCID, but it's far from
definitive.

If folks have better ideas for a test methodology, or specific workloads
or hardware where you want to see this tested, please speak up.

Details below.

---

With page table isolation on systems with INVPCID (Haswell and newer),
we have a choice on how to flush the TLB for the user address space. We
can either use INVPCID when running in the kernel to flush individual
pages out of the user address space, or we can just flush the entire TLB
when we reload the page table pointer (CR3) at the kernel->user switch
(initiated by invalidate_user_asid()).

It's currently implemented with INVPCID, mostly because that was the
easiest code that I could drop in to place without adding any
infrastructure. It was not a data-driven choice. Now that things have
settled down, it's time to collect some data. I rigged up a patch to
help me time the TLB flush instruction cost with tracepoints and turn
INVLPG on/off at runtime:

https://www.sr71.net/~dave/invpcid-on-off.patch

Why does this matter? We are effectively balancing the incremental
TLB-flush-time cost of ~350 cycles per-page with the potential
tens-of-thousands of cycles which it costs to *fully* reload the TLB. I
say "potential" because the CPU is *really* good at hiding TLB fill
latencies. There may be thousands of cycles where the page walker is
trying to fill entries, but that does not mean that the CPU is stalled
waiting for those fills.

The system here is a 4-core (no hyperthreading) Skylake desktop system.
The workload is a make -j8 kernel compile. HT will probably only serve
to further mask the TLB fill latencies, so it's a blessing that it is
absent here.

Using INVPCID compiles a kernel in 875.36s, while using the deferred
flush is 880.45s. That's 0.58% worse when we use CR3. While it is
small, it is consistent across runs. It's also a workload that, on this
system, is plowing through hundreds of millions of L1 TLB misses a
second, so presumably we *do* pay a cost for the full flush.

A few things of note:

* INVPCID is around 1.75x the cost of doing an INVLPG alone (INVLPG is
~200 cycles and INVLPG+INVPCID is ~550). This roughly correlates with
how these show up in profiles as well.
* invalidate_user_asid() roughly doubles the number of
"dtlb_load_misses.walk_completed" events vs INVLPG. But, it makes
those misses consume 60% more hardware page walker cycles per miss.
In other words, the CR3-based flushing causes more TLB misses, but
they are _relatively_ cheap misses.

The raw data are below.


-------------------------------------------------------------------------

I roughly tried to repeat the methodology from:

> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=WIP.x86/kpti&id=a5102476a24b

The systems are different, though, so don't look too much the old
results versus these.

> echo y > /sys/kernel/debug/x86/use_invpcid_flush_one
>
> 1: 91.70% 93.49% avg cycles: 1094 cycles/page: 1094 samples: 1238264
> 2: 1.29% 94.78% avg cycles: 1503 cycles/page: 751 samples: 17451
> 3: 0.29% 95.07% avg cycles: 2016 cycles/page: 672 samples: 3926
> 4: 1.35% 96.42% avg cycles: 2719 cycles/page: 679 samples: 18200
> 5: 0.01% 96.43% avg cycles: 3875 cycles/page: 775 samples: 159
> 9: 0.02% 96.45% avg cycles: 6329 cycles/page: 703 samples: 272
> 20: 0.01% 96.46% avg cycles: 11654 cycles/page: 582 samples: 73
> 22: 0.06% 96.52% avg cycles: 12630 cycles/page: 574 samples: 824
> 25: 0.05% 96.57% avg cycles: 13952 cycles/page: 558 samples: 676
> 26: 0.06% 96.63% avg cycles: 14492 cycles/page: 557 samples: 807
> 27: 0.81% 97.45% avg cycles: 15043 cycles/page: 557 samples: 10977
> 32: 0.01% 97.46% avg cycles: 18568 cycles/page: 580 samples: 107
> 33: 0.08% 97.53% avg cycles: 18481 cycles/page: 560 samples: 1022
> 35: 0.01% 97.55% avg cycles: 20303 cycles/page: 580 samples: 175
> 36: 0.00% 97.56% avg cycles: 20106 cycles/page: 558 samples: 67
> 37: 0.00% 97.56% avg cycles: 20630 cycles/page: 557 samples: 62
> 39: 0.00% 97.57% avg cycles: 21639 cycles/page: 554 samples: 58
> 40: 0.00% 97.57% avg cycles: 22340 cycles/page: 558 samples: 51
> 51: 0.01% 97.59% avg cycles: 28134 cycles/page: 551 samples: 71
> 55: 0.00% 97.59% avg cycles: 30257 cycles/page: 550 samples: 64
> 60: 0.01% 97.61% avg cycles: 32909 cycles/page: 548 samples: 101
> 64: 0.02% 97.63% avg cycles: 35670 cycles/page: 557 samples: 264
> 65: 0.03% 97.66% avg cycles: 35676 cycles/page: 548 samples: 395
> 129: 2.31% 99.98% avg cycles: 70550 cycles/page: 546 samples: 31205
> 133: 0.01% 99.99% avg cycles: 73157 cycles/page: 550 samples: 129

Not using INVPCID:

> echo n > /sys/kernel/debug/x86/use_invpcid_flush_one
>
> 1: 91.32% 93.25% avg cycles: 692 cycles/page: 692 samples: 1622083
> 2: 1.50% 94.75% avg cycles: 883 cycles/page: 441 samples: 26638
> 3: 0.27% 95.02% avg cycles: 1146 cycles/page: 382 samples: 4860
> 4: 1.61% 96.63% avg cycles: 1276 cycles/page: 319 samples: 28557
> 5: 0.00% 96.63% avg cycles: 2177 cycles/page: 435 samples: 68
> 7: 0.01% 96.64% avg cycles: 1888 cycles/page: 269 samples: 105
> 9: 0.02% 96.66% avg cycles: 3061 cycles/page: 340 samples: 308
> 14: 0.00% 96.66% avg cycles: 3294 cycles/page: 235 samples: 56
> 20: 0.09% 96.76% avg cycles: 4327 cycles/page: 216 samples: 1614
> 22: 0.07% 96.83% avg cycles: 4955 cycles/page: 225 samples: 1259
> 25: 0.06% 96.88% avg cycles: 5321 cycles/page: 212 samples: 1004
> 26: 0.06% 96.95% avg cycles: 5521 cycles/page: 212 samples: 1129
> 27: 0.92% 97.86% avg cycles: 5622 cycles/page: 208 samples: 16284
> 32: 0.01% 97.88% avg cycles: 7315 cycles/page: 228 samples: 111
> 33: 0.07% 97.94% avg cycles: 7069 cycles/page: 214 samples: 1217
> 35: 0.00% 97.95% avg cycles: 7923 cycles/page: 226 samples: 79
> 36: 0.00% 97.96% avg cycles: 7533 cycles/page: 209 samples: 88
> 37: 0.01% 97.96% avg cycles: 7707 cycles/page: 208 samples: 94
> 39: 0.00% 97.97% avg cycles: 8233 cycles/page: 211 samples: 78
> 40: 0.00% 97.97% avg cycles: 8547 cycles/page: 213 samples: 56
> 60: 0.01% 98.00% avg cycles: 12084 cycles/page: 201 samples: 120
> 64: 0.02% 98.02% avg cycles: 13092 cycles/page: 204 samples: 385
> 65: 0.03% 98.05% avg cycles: 13050 cycles/page: 200 samples: 525
> 129: 1.93% 99.99% avg cycles: 25663 cycles/page: 198 samples: 34209

These are timed kernel compiles. First column is the value of
/sys/kernel/debug/x86/use_invpcid_flush_one. The rest is just the
output of /usr/bin/time squashed onto one line.

> n 879.25 77.50 4:15.52 374%CPU (0avgtext+0avgdata 798736maxresident)k 0inputs+4452416outputs (1major+58366597minor)pagefaults 0swaps
> n 879.79 77.59 4:15.25 375%CPU (0avgtext+0avgdata 799840maxresident)k 0inputs+4452416outputs (1major+58381178minor)pagefaults 0swaps
> n 879.94 77.11 4:15.50 374%CPU (0avgtext+0avgdata 798604maxresident)k 0inputs+4452416outputs (1major+58370506minor)pagefaults 0swaps
> n 880.05 76.71 4:15.70 374%CPU (0avgtext+0avgdata 800664maxresident)k 0inputs+4452416outputs (1major+58373637minor)pagefaults 0swaps
> n 880.52 76.40 4:15.22 374%CPU (0avgtext+0avgdata 800408maxresident)k 0inputs+4452416outputs (1major+58382223minor)pagefaults 0swaps
> n 880.53 76.65 4:15.61 374%CPU (0avgtext+0avgdata 800580maxresident)k 0inputs+4452416outputs (1major+58378837minor)pagefaults 0swaps
> n 880.57 76.66 4:15.74 374%CPU (0avgtext+0avgdata 798960maxresident)k 0inputs+4452416outputs (1major+58377678minor)pagefaults 0swaps
> n 880.73 76.40 4:15.34 374%CPU (0avgtext+0avgdata 798716maxresident)k 0inputs+4452416outputs (1major+58367535minor)pagefaults 0swaps
> n 880.74 76.56 4:15.78 374%CPU (0avgtext+0avgdata 798808maxresident)k 0inputs+4452416outputs (1major+58382090minor)pagefaults 0swaps
> n 880.88 76.27 4:15.38 374%CPU (0avgtext+0avgdata 798896maxresident)k 0inputs+4452416outputs (1major+58381214minor)pagefaults 0swaps
> n 880.89 75.77 4:15.42 374%CPU (0avgtext+0avgdata 800036maxresident)k 0inputs+4452416outputs (1major+58378255minor)pagefaults 0swaps
> n 881.36 75.59 4:15.50 374%CPU (0avgtext+0avgdata 800516maxresident)k 0inputs+4452416outputs (1major+58357696minor)pagefaults 0swaps

> y 874.62 77.98 4:14.47 374%CPU (0avgtext+0avgdata 799620maxresident)k 0inputs+4452416outputs (1major+58400992minor)pagefaults 0swaps
> y 874.95 77.34 4:14.54 374%CPU (0avgtext+0avgdata 799524maxresident)k 0inputs+4452416outputs (1major+58371458minor)pagefaults 0swaps
> y 875.07 77.22 4:14.43 374%CPU (0avgtext+0avgdata 799996maxresident)k 0inputs+4452416outputs (1major+58369501minor)pagefaults 0swaps
> y 875.13 77.86 4:14.71 374%CPU (0avgtext+0avgdata 799880maxresident)k 0inputs+4452416outputs (1major+58383810minor)pagefaults 0swaps
> y 875.16 77.44 4:14.32 374%CPU (0avgtext+0avgdata 798704maxresident)k 0inputs+4452416outputs (1major+58393760minor)pagefaults 0swaps
> y 875.41 77.12 4:14.32 374%CPU (0avgtext+0avgdata 798628maxresident)k 0inputs+4452416outputs (1major+58370756minor)pagefaults 0swaps
> y 875.48 76.98 4:14.43 374%CPU (0avgtext+0avgdata 800140maxresident)k 0inputs+4452416outputs (1major+58381980minor)pagefaults 0swaps
> y 875.51 77.03 4:14.30 374%CPU (0avgtext+0avgdata 799720maxresident)k 0inputs+4452416outputs (1major+58379019minor)pagefaults 0swaps
> y 875.67 77.01 4:14.79 373%CPU (0avgtext+0avgdata 798508maxresident)k 0inputs+4452416outputs (1major+58377725minor)pagefaults 0swaps
> y 875.71 77.15 4:14.15 374%CPU (0avgtext+0avgdata 798728maxresident)k 0inputs+4452416outputs (1major+58395120minor)pagefaults 0swaps
> y 875.72 77.10 4:14.59 374%CPU (0avgtext+0avgdata 800744maxresident)k 0inputs+4452416outputs (1major+58377726minor)pagefaults 0swaps
> y 875.73 76.69 4:13.98 374%CPU (0avgtext+0avgdata 800120maxresident)k 0inputs+4452416outputs (1major+58383577minor)pagefaults 0swaps