Re: [tip:perfcounters/core] perf_counter: x86: Fix call-chainsupport to use NMI-safe methods
From: Mathieu Desnoyers
Date: Mon Jun 15 2009 - 16:06:48 EST
* Ingo Molnar (mingo@xxxxxxx) wrote:
>
> * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > > If it's faster, this becomes a legit (albeit complex)
> > > micro-optimization in a _very_ hot codepath.
> >
> > I don't think it's all that hot. It's not like it's the return to
> > user mode.
>
> Well i guess it depends. For server apps it is true - syscalls are a
> lot more dominant, MMs are long-running so any startup cost gets
> amortized and pagefaults are avoided.
>
> For something like a kernel build we have 7 times as many pagefaults
> as syscalls:
>
> aldebaran:~/linux/linux> perf stat -- make -j32 >/dev/null
> [...]
> Performance counter stats for 'make -j32':
>
> 1444281.076741 task-clock-msecs # 14.429 CPUs
> 219991 context-switches # 0.000 M/sec
> 18335 CPU-migrations # 0.000 M/sec
> 38465628 page-faults # 0.027 M/sec
> 4374762924204 cycles # 3029.025 M/sec
> 2645979309823 instructions # 0.605 IPC
> 42398991227 cache-references # 29.356 M/sec
> 4371920878 cache-misses # 3.027 M/sec
>
> 100.097787566 seconds time elapsed.
>
> So we have 38465628 page-faults, or one every 68788 instructions,
> one every 113731 cycles.
>
> 10 cycles saved in the page fault costs means 0.01% performance win
> - or about 10 milliseconds shaven off the kernel build time.
>
> 100 cycles saved (which is impossible really in the entry/exit path)
> would mean 0.1% win.
>
> 5653639 syscalls (according to strace -c) - which is a factor of 6.8
> lower. Same goes for shell scripts or most of the clicking we do on
> a GUI.
>
> It's not a big factor for sure.
>
> Btw., the biggest pagefault cost is in the fault handling itself
> (the page clearing):
>
> 4.14% [k] do_page_fault
> 1.20% [k] sys_write
> 1.10% [k] sys_open
> 0.63% [k] sys_exit_group
> 0.48% [k] smp_apic_timer_interrupt
> 0.37% [k] sys_read
> 0.37% [k] sys_execve
> 0.20% [k] sys_mmap
> 0.18% [k] sys_close
> 0.14% [k] sys_munmap
> 0.13% [k] sys_poll
> 0.09% [k] sys_newstat
> 0.07% [k] sys_clone
> 0.06% [k] sys_newfstat
>
> it totals to 4.14% of the total cost (user-space cycles included) of
> a kernel build, on a Nehalem box.
>
> Ingo
In the category "crazy ideas one should never express out loud", I could add the
following. We could choose to save/restore the cr2 register on the local stack
at every interrupt entry/exit, and therefore allow the page fault handler to
execute with interrupts enabled.
I have not benchmarked the interrupt disabling overhead of the page fault
handler handled by starting an interrupt-gated handler rather than trap-gated
handler, but cli/sti instructions are known to take quite a few cycles on some
architectures. e.g. 131 cycles for the pair on P4, 23 cycles on AMD Athlon X2
64, 43 cycles on Intel Core2.
I am tempted to think that taking, say, ~10 cycles on the interrupt path worths
it if we save a few tens of cycles on the page fault handler fast path.
But again, this calls for benchmarks.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/