Re: [RFC v2 PATCH 6/7] x86/entry: get rid of ALLOC_PT_GPREGS_ON_STACK and SAVE_AND_CLEAR_REGS

From: Ingo Molnar
Date: Thu Feb 08 2018 - 04:47:25 EST



* Dominik Brodowski <linux@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Wed, Feb 07, 2018 at 12:44:41PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 7, 2018 at 12:15 PM, Dominik Brodowski
> > <linux@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Note: The testb $3, CS(%rsp) instruction in idtentry() does not need
> > > modification. Previously %rsp was manually decreased by 15*8; with
> > > this patch, %rsp is decreased by 15 pushq instructions. Moreover,
> > > error_entry did and does the exact same test (with offset=8) after
> > > the registers have been moved/pushed and cleared.
> >
> > So this has the problem that now those save/clear instructions will
> > all be done in that idtentry macro.
> >
> > So now that code will be duplicated for all the users of that macro.
> >
> > The old code did the saving in the common error_entry and
> > paranoid_entry routines, in order to be able to share all the code,
> > and making the duplicated stub functions generated by the idtentry
> > macro smaller.
> >
> > Now, admittedly the new push sequence is much smaller than the old
> > movq sequence, so the duplication doesn't hurt as much, but it's still
> > likely quite noticeable.
> >
> > So this removes lines of asm code, but it adds a lot of instructions
> > to the end result thanks to the macro, I think.
>
> Indeed, that is the case (see below). However, if we want to switch to
> PUSH instructions and do this in a routine which is call'ed and which
> ret'urns, %rsp needs to be moved around even more often than the old
> ALLOC_PT_GPREGS_ON_STACK macro did (which you wanted to get rid of,
> IIUYC). Or do I miss something?
>
> text data bss dec hex filename
> 19500 0 0 19500 4c2c arch/x86/entry/entry_64.o-orig
> 19510 0 0 19510 4c36 arch/x86/entry/entry_64.o-3_of_7
> 21105 0 0 21105 5271 arch/x86/entry/entry_64.o-5_of_7
> 24307 0 0 24307 5ef3 arch/x86/entry/entry_64.o-7_of_7

So while this shows a +~5K increase in text size, these numbers also _very_
significantly over-represent the extra footprint. The assumptions that resulted in
us compressing the IRQ entry code have changed very significantly with the new x86
IRQ allocation code we introduced in the last year:

- IRQ vectors are usually populated in tightly clustered groups.

With our new vector allocator code the typical per CPU allocation percentage on
x86 systems is ~3 device vectors and ~10 fixed vectors out of ~220 vectors -
i.e. a very low ~6% utilization (!). This means that the _real_ text footprint
increase is probably closer to 0.1K of text...

This is a very typical picture on an average desktop system:

/sys/kernel/debug/irq/domains/VECTORS:

Online bitmaps: 40
Global available: 8007
Global reserved: 24
Total allocated: 73
System: 42: 0-19,32,50,128,237-255

| CPU | avl | man | act | vectors
0 199 0 3 33-34,48
1 199 0 3 33-35
2 199 0 3 33-35
3 199 0 3 33-35
4 199 0 3 33-35
5 199 0 3 33-35
6 199 0 3 33-35
7 199 0 3 33-35
8 199 0 3 33-35
9 199 0 3 33-35
10 201 0 1 33
11 201 0 1 33
12 201 0 1 33
13 201 0 1 33
14 201 0 1 33
15 201 0 1 33
16 201 0 1 33
17 201 0 1 33
18 201 0 1 33
19 201 0 1 33
20 199 0 3 33-35
21 199 0 3 33-35
22 199 0 3 33-35
23 199 0 3 33-35
24 199 0 3 33-35
25 199 0 3 33-35
26 199 0 3 33-35
27 199 0 3 33-35
28 199 0 3 33-35
29 199 0 3 33-35
30 201 0 1 33
31 201 0 1 33
32 201 0 1 33
33 202 0 0
34 202 0 0
35 202 0 0
36 202 0 0
37 202 0 0
38 202 0 0
39 202 0 0

I.e. the average number of (device) IRQ vectors per CPU is around 2, with the
max being 3.

The days where we allocated a lot of vectors on every CPU and the compression
of the IRQ entry code text mattered are over.

- Another issue is that only a small minority of vectors is frequent enough to
actually matter to cache utilization in practice: 3-4 key IPIs and 1-2 device
IRQs at most - and those vectors tend to be tightly clustered as well into about
two groups, and are probably already on 2-3 cache lines in practice.

For the common case of 'cache cold' IRQs it's the depth of the call chain and
the fragmentation of the resulting I$ that should be the main performance limit
- not the overall size of it.

- The CPU side cost of IRQ delivery is still very expensive even in the best, most
cached case, as in 'over a thousand cycles'. So much stuff is done that maybe
contemporary x86 IRQ entry microcode already prefetches the IDT entry and its
expected call target address.

So for those reasons I'm really tempted by the all around simplification offered
by this series:

2 files changed, 62 insertions(+), 160 deletions(-)

Basically in this specific case I'd like to turn the argument around,
use this simpler design and put the burden of proof on the patches that
want to _complicate it_ beyond this simple, straightforward model: show us
the numbers that it truly makes a difference: it's not that hard to
spray a CPU with IRQs by using IPIs, and the cache miss rate and the
overall perf counter stats should tell a clear story about whether it
makes sense to cache-compress (and complicate) the IRQ entry sequences.

Thanks,

Ingo