Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter

From: Linus Torvalds
Date: Fri Apr 03 2015 - 14:09:07 EST

Next message: Olof Johansson: "Re: [PATCH] arm64: dts: Add Qualcomm APQ8016 SBC evaluation board dts"
Previous message: Olof Johansson: "Re: [GIT PULL] at91: dt for 4.1 #2"
In reply to: Denys Vlasenko: "Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Next in thread: H. Peter Anvin: "Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Apr 3, 2015 at 9:54 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>
> How about this version?
> It's still isn't a star of readability,
> but the structure of the 32-byte code block is more visible now...

Do we really even want to be this clever in the first place?

The thing is, when we take an interrupt:

(a) the L1 I$ is always cold

(b) the instruction decoder has never had time to run ahead

(c) there are usually not that many different interrupts anyway, even
under load (ie you'd have maybe disk and networking)

(d) we intentionally spread out the different interrupt vector numbers

(e) the 32-byte block thing is questionable, most older
micro-architectures fetch in 16-byte blocks iirc.

So what this tells me is that:

- (a+b) the jump-to-jump is likely fairly expensive, because even
though they are in the same cacheline, the front end hasn't gotten
ahead of anything, so there's no hiding any front end pipeline
hickups.

- (c+d) there is likely very little advantage to trying to "pack"
things in cachelines

- (d+e) the 7-instructions-in-one-32-byte-block doesn't really sound
all that big of a win, and it does cause a 16-byte split for some
interrupt.

In other words, I'd suggest that we just use simple unconditional
5-byte branch instead. Add the two-byte "push" instruction, you have 7
bytes per interrupt. Align that 7 bytes up to 8, and none of them ever
cross a 16-byte boundary.

Simple, clean, and slightly bigger in memory footprint, but probably
not noticeably more so in cache footprint, simply because there
usually aren't that many active interrupts anyway.

The people who do millions of networking interrupts per second and
have network cards that steer things to many different interrupts
already try to make sure that the steering goes to different CPU's -
otherwise there wouldn't be any *point* to steering things. So that
particular case of "lots of active interrupts" doesn't have a bigger
cache footprint *either*, since any particular CPU L1 I$ will still
only handle a few interrupts.

So you get "only" 4 interrupt cases per 32 bytes rather than 7. But is
that odd double jump and all this complexity really worth it?

So I really suggest just doing something stupid and straightforward
(and completely untested) like this:

.macro push_vector
pushq_cfi $(~vector+0x80)
jmp common_interrupt
.align 8
.endm

vector=FIRST_EXTERNAL_VECTOR
.align 64
ENTRY(irq_entries_start)
.rept 256 /* this number does not need to be exact, just big enough */
make_vector
.endr

and just be done with it.

(Of course, you have to change the code that knows about the "7
entries in 32 bytes" patterns too, but that's just going to be much
simpler now).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Olof Johansson: "Re: [PATCH] arm64: dts: Add Qualcomm APQ8016 SBC evaluation board dts"
Previous message: Olof Johansson: "Re: [GIT PULL] at91: dt for 4.1 #2"
In reply to: Denys Vlasenko: "Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Next in thread: H. Peter Anvin: "Re: [PATCH] x86/asm/entry/64: pack interrupt dispatch table tighter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]