Then you can't use __local_xxx, and so many architectures will use
atomic instructions (the ones who don't are the ones with tripled
cacheline footprint of this structure).
They are wrong then. atomic instructions is the wrong implementation
and they would be better off with asm-generic.
If anything they should use per_cpu counters for interrupts and use seq locks.
Or just turn off the interrupts for a short time
in the low level code.
Sure i386 and x86-64 are happy, but this would probably slow down
most other architectures.
I think it is better to fix the other architectures then - if they
are really using a full scale bus lock for this they're just wrong.
I don't think it is a good idea to do a large change in generic
code just for dumb low level code.