Re: [RFC][PATCH 00/10] Use global pages with PTI

From: Dave Hansen
Date: Fri Feb 23 2018 - 20:50:03 EST


On 02/22/2018 01:52 PM, Linus Torvalds wrote:
> Side note - and this may be crazy talk - I wonder if it might make
> sense to have a mode where we allow executable read-only kernel pages
> to be marked global too (but only in the kernel mapping).

We did that accidentally, somewhere. It causes machine checks on K8's
iirc, which is fun (52994c256df fixed it). So, we'd need to make sure
we avoid it there, or just make it global in the user mapping too.

> Of course, maybe the performance advantage from keeping the ITLB
> entries around isn't huge, but this *may* be worth at least asking
> some Intel architects about?

I kinda doubt it's worth the trouble. Like you said, this probably
doesn't even matter when we have PCID support. Also, we'll usually map
all of this text with 2M pages, minus whatever hangs over into the last
2M page of text. My laptop looks like this:

> 0xffffffff81000000-0xffffffff81c00000 12M ro PSE x pmd
> 0xffffffff81c00000-0xffffffff81c0b000 44K ro x pte

So, even if we've flushed these entries, we can get all of them back
with a single cacheline worth of PMD entries.

Just for fun, I tried a 4-core Skylake system with KPTI and nopcid and
compiled a random kernel 10 times. I did three configs: no global, all
kernel text global + cpu_entry_area, and only cpu_entry_area + entry
text. The delta percentages are from the Baseline. The deltas are
measurable, but the largest bang for our buck is obviously the entry text.

User Time Kernel Time Clock Elapsed
Baseline (33 GLB PTEs) 907.6 81.6 264.7
Entry (28 GLB PTEs) 910.9 (+0.4%) 84.0 (+2.9%) 265.2 (+0.2%)
No global( 0 GLB PTEs) 914.2 (+0.7%) 89.2 (+9.3%) 267.8 (+1.2%)

It's a single line of code to go from the "33" to "28" configuration, so
it's totally doable. But, it means having and parsing another boot
option that confuses people and then I have to go write actual
documentation, which I detest. :)

My inclination would be to just do the "entry" stuff as global just as
this set left things and leave it at that.

I also measured frontend stalls with the toplev.py tool[1]. They show
roughly the same thing, but a bit magnified since I was only monitoring
the kernel and because in some of these cases, even if we stop being
iTLB bound we just bottleneck on something else.

I ran:

python ~/pmu-tools/toplev.py --kernel --level 3 make -j8

And looked for the relevant ITLB misses in the output:

Baseline
> FE Frontend_Bound: 24.33 % Slots [ 7.68%]
> FE ITLB_Misses: 5.16 % Clocks [ 7.73%]
Entry:
> FE Frontend_Bound: 26.62 % Slots [ 7.75%]
> FE ITLB_Misses: 12.50 % Clocks [ 7.74%]
No global:
> FE Frontend_Bound: 27.58 % Slots [ 7.65%]
> FE ITLB_Misses: 14.74 % Clocks [ 7.71%]

1. https://github.com/andikleen/pmu-tools