Re: HZ=1000 & kernel profiling?

David S. Miller (davem@jenolan.rutgers.edu)
Sun, 19 Jan 1997 15:01:16 -0500


Date: Sun, 19 Jan 1997 20:47:26 +0100 (MET)
From: Ingo Molnar <mingo@pc5829.hil.siemens.at>

some hardware sucks ... for example, on my pentium system, alone the cost
of getting to the first instruction of the IRQ handler costs ... 8
microseconds :(( I guess it's due to the legacy PIC chip still sitting
still on the ISA bus ...

really, 8 microseconds, from the point where CPU execution stops, to the
point where the interrupt vector shows. It's 800 wasted cycles. PC
hardware sucks.

So what, the Alpha eats 700 cycles for each ll/sc atomic sequence
because braindead DEC engineers thought it was a cool idea to go
directly to physical memory and bypass the caches for these.

i will measure how expensive the SMP IPI interrupts are, from the hardware
point of view. Maybe it makes sense to bombard one CPU with cross-CPU
interrupts, generating profiling irqs. They should be much cheaper,
theoretically, and if you control the bombardment, they can be rather
random and irrational compared to the timer IRQ on the first CPU.

curious how the typical hardware irq latency numbers look like on the
Sparc :)

On the supersparc at least, each trap costs a lot sometimes. This is
because of the requirement that before the first instruction of the
trap can be executed the entire store buffer must be flushed to the L1
cache or the L2 cache (the latter is if L2 cache is present, if it is
the L1 caches are write-through, else they are copy back). This can
eat numerous cycles especially on a cache miss on the write since you
must wait until you can arbitrate for the system bus for each line
missed which the store buffer contents want to go.

On the other hand HyperSparc service traps extremely fast, when the
exception condition is detected the chip prefetches the instructions
for the trap handler entry point _in parallel_ with the pipeline
flush, by the time the pipe is cleaned the chip is executing the trap
entry code in a full super scalar fashion.

Also, PIO accesses are expensive on the Sparc as well, more so on the
higher end SMP systems (you could be poking at I/O space 3 system
boards away from the one the cpu executing the load/store is on).

Also, the Intel eats close to 40 clocks for cli/sti instructions.
That is one of the prime motivations behind the software based cli/sti
mechanism I am working on, nice side effect is that cli/sti will be
multiprocessor safe and thus all interrupts can be serviced with zero
locking on SMP. I'm eager to see the performance gains, I think they
will be high especially for the SMP case since no other SMP Unix I
know can pull this off nor tries to.

---------------------------------------------////
Yow! 11.26 MB/s remote host TCP bandwidth & ////
199 usec remote TCP latency over 100Mb/s ////
ethernet. Beat that! ////
-----------------------------------------////__________ o
David S. Miller, davem@caip.rutgers.edu /_____________/ / // /_/ ><