Tejun, could you please also add the patch below to your lineup too?
It is an optimization and a cleanup, and adds the following new generic percpu methods:
percpu_read()
percpu_write()
percpu_add()
percpu_sub()
percpu_or() percpu_xor()
and implements support for them on x86. (other architectures will fall back to a default implementation)
The advantage is that for example to read a local percpu variable, instead of this sequence:
return __get_cpu_var(var);
ffffffff8102ca2b: 48 8b 14 fd 80 09 74 mov -0x7e8bf680(,%rdi,8),%rdx
ffffffff8102ca32: 81 ffffffff8102ca33: 48 c7 c0 d8 59 00 00 mov $0x59d8,%rax
ffffffff8102ca3a: 48 8b 04 10 mov (%rax,%rdx,1),%rax
We can get a single instruction by using the optimized variants:
return percpu_read(var);
ffffffff8102ca3f: 65 48 8b 05 91 8f fd mov %gs:0x7efd8f91(%rip),%rax
I also cleaned up the x86-specific APIs and made the x86 code use these new generic percpu primitives.
It looks quite hard to convince the compiler to generate the optimized single-instruction sequence for us out of __get_cpu_var(var) - or can you perhaps see a way to do it?
The patch is against your latest zero-based percpu / pda unification tree. Untested.