I have these questions:
* There is a rw semaphore that is locked for read for nearly all operations and locked for write only rarely. However locking for read causes cache line pingpong on SMP systems. Do you have an idea how to make it better?
It could be improved by making a semaphore for each CPU and locking for read only the CPU's semaphore and for write all semaphores. Or is there a better method?
If you believe you need a semaphore for protecting a mostly read structure, then RCU is certainly a good candidate. (ie no locked operation at all)
The problem with a per_cpu biglock is that you may consume a lot of RAM for big NR_CPUS. Count 32 KB per 'biglock' if NR_CPUS=1024
* This leads to another observation --- on i386 locking a semaphore is 2 instructions, on x86_64 it is a call to two nested functions. Has it some reason or was it just implementator's laziness? Given the fact that locked instruction takes 16 ticks on Opteron (and can overlap about 2 ticks with other instructions), it would make sense to have optimized semaphores too.
Hum, please dont use *lazy*, this could make Andi unhappy :)
What are you calling semaphore exactly ?
Did you read Documentation/mutex-design.txt ?