Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes

From: Nick Piggin
Date: Thu Nov 13 2008 - 20:11:46 EST


Sorry to reply so late on this slightly offtopic rant...

On Wednesday 05 November 2008 21:26, Ingo Molnar wrote:
> * Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> > On Tue, Nov 04, 2008 at 09:44:00PM +0100, Ingo Molnar wrote:

> > > It's only an issue on ancient CPUs that export all their LOCKed
> > > cycles to the bus. Pentium and older or so. The PPro got it right
> > > already.
> >
> > ??? LOCK slowness is not because of the bus. And I know you know
> > that Ingo, so I don't know why you wrote that bogosity above.
>
> .. of course the historic LOCK slowness was all due to the system bus:
> very old CPUs exported a LOCK signal to the system bus for every
> LOCK-prefix access (implicit and explicit) and that made it _really_
> expensive. (hundreds of cycles)
>
> ... on reasonably modern CPUs the LOCK-ed access has been abstracted
> away to within the CPU, and the cost of LOCK-ed access is rather low
> (think 10-20 cycles - of course only if there's no cache miss cost)
> (That's obviously the case with the GDT, with is both per CPU and well
> cached.)

Locked instruction AFAIR is about 50 cycles on Core2. I think it is
a bit lower on K8. On Nehalem, which has optimisations for these,
I have heard it is still about 20-25 cycles. Although I don't have
one, so I don't actually know.

These (on my Core2) don't seem to pipeline at all with other
instructions either. So on my Core2, a locked instruction is worth
maybe 150-200 regular pipelined, superscalar instructions.

There is another big reason why lock instructions are expensive,
and that is because they have to prevent subsequent loads from
passing any previous stores becoming visible. This in theory could
be somewhat speculated, but no matter what happens, the program
visible state can't be committed until the stores are.

I heard from an Intel hardware engineer that Nehalem has some
really fancy logic in it to make locked instructions "free", that
was nacked from earlier CPUs because it was too costly. So obviously
it is taking a fair whack of transistors or power for them to do it.
And even then it is far from free, but still seems to be one or two
orders of magnitude more expensive than a regular instruction.


> on _really_ modern CPUs LOCK can be as cheap as just a few cycles - so

Oh, maybe I'm mistaken about Nehalem then? How many is "just a few"?
If it is 25 non-pipelined cycles, then that's still 100 instructions
if it is a 4 issue machine.


> low that we can stop bothering about it in the future. There's no
> fundamental physical reason why the LOCK prefix (implicit or explicit)
> should be expensive.

Even if they could make it free on the software side, it is obviously
expensive on the hardware side. Not bothering about it is a copout.
The atomic instruction speedups in Nehalem are cool, but what would
have been even cooler is if Intel had decided *not* to spend resources
making this cheaper because they found Linux has so few locked
instructions :)

Even if somehow the x86 ISA didn't have the implicit memory ordering
requirement in the lock instruction, I think it's obviously a special
case path that doesn't fit in with a load/store uarch (whether they
implement it in uops with ll/sc like thing or whatnot, it's going to
need special logic).

IMO, we shouldn't stop bothering about LOCK prefix in the forseeable
future.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/