On Tue, Aug 23, 2016 at 03:36:17PM -0400, Waiman Long wrote:
I think this is the right way to go. There isn't any big change in theYeah, I'll try and run some workloads tomorrow if you and Jason don't
slowpath, so the contended performance should be the same. The fastpath,
however, will get a bit slower as a single atomic op plus a jump instruction
(a single cacheline load) is replaced by a read-and-test and compxchg
(potentially 2 cacheline loads) which will be somewhat slower than the
optimized assembly code.
beat me to it ;-)
Alternatively, you can replace theProblem with that is that we need to preserve the flag bits, so we need
__mutex_trylock() in mutex_lock() by just a blind cmpxchg to optimize the
fastpath further.
the initial load.
Or were you thinking of: cmpxchg(&lock->owner, 0UL, (unsigned
long)current), which only works on uncontended locks?
A cmpxhcg will still be a tiny bit slower than otherI don't think cmpxchg is much slower than say xadd or xchg, the typical
atomic ops, but it will be more acceptable, I think.
problem with cmpxchg is the looping part, but single instruction costs
should be similar.