Good question. Btw. - faster by what order of magnitude?
pushf + popf is on K8 at least ~18 cycles, on P4 it is much more
because they synchronize the pipeline there (hundreds of cycles)
cpu local add would be a few cycles at best and doesn't have
any impact on the pipeline
local_irq_save/restore seems to be fine for kernel/profile.c
Reason 1:
cpu_local_* uses __get_cpu_var, which conflicts with struct statistic
being embedded into struct xyz that is allocated whenever the client
needs it.
I could try to use local_t in conjunction with local_add etc.
(as seen in include/linux/dmaengine.h in 2.6.17-mm6).
Does this also yield a performance gain worth consideration?
Yes, but you would need preempt_disable() then. For non preemptible
kernels (far majority) that would be already a big win.
So, removing local_irq_save/restore would require statistics to be
switched on and their buffers being available all the time. That is,
buffers holding counters etc. can't be allocated at run time - what
if allocation fails? (Should I leave this issue to clients?).
Can't you use RCU for this?
Reason 4:
The alleged overhead of local_irq_save/restore (as compared
to atomic operations)
local_* doesn't need to be atomic. IT isn't on x86 at least.
On some other architectures it can be, but i think it's just a SMOP
of fixing them.
-Andi