Re: [PATCH v7 1/4] spinlock: A new lockref structure for locklessupdate of refcount

From: Linus Torvalds
Date: Thu Aug 29 2013 - 22:31:32 EST


On Thu, Aug 29, 2013 at 7:06 PM, Michael Neuling <mikey@xxxxxxxxxxx> wrote:
>
> Running on a POWER7 here with 32 threads (8 cores x 4 threads) I'm
> getting some good improvements:

That's *much* better than I get. But I literally just have a single
socket with two cores (and HT, so four threads) in my test machine, so
I really have a hard time getting any real contention. And the main
advantage of the patch should be when you actually have CPU's spinning
on that dentry d_lock.

Also, on x86, there are no advantages to cmpxchg over a spinlock -
they are both exactly one equally serializing instruction. If
anything, cmpxchg is worse due to having a cache read before the
write, and a few cycles slower anyway. So I actually expect the x86
code to slow down a tiny bit for the single-threaded case, although
that should be hopefully unmeasurable.

On POWER, you may have much less serialization for the cmpxchg. That
may sadly be something we'll need to fix - the serialization between
getting a lockref and checking sequence counts etc may need some extra
work.

So it may be that you are seeing unrealistically good numbers, and
that we will need to add a memory barrier or two. On x86, due to the
locked instruction semantics, that just isn't an issue.

> The numbers move around about 10% from run to run.

Please note that the whole "dentry hash chains may be better" for one
run vs another, and that's something that will _persist_ between
subsequent runs, so you may see "only 10% variability", but there may
be a bigger picture variability that you're not noticing because you
had to reboot in between.

To be really comparable, you should really run the stupid benchmark
after fairly equal boot up sequences. If the machine had been up for
several days for one set of numbers, and freshly rebooted for the
other, it can be a very unfair comparison.

(I long ago had a nice "L1 dentry cache" patch that helped with the
fact that the dentry chains *can* get long especially if you have tons
of memory, and that helped with this kind of variability a lot - and
improved performance too. It was slightly racy, though, which is why
it never got merged).

> powerpc patch below. I'm using arch_spin_is_locked() to implement
> arch_spin_value_unlocked().

Your "slock" is of type "volatile unsigned int slock", so it may well
cause those temporaries to be written to memory.

It probably doesn't matter, but you may want to check that the result
of "make lib/lockref.s" looks ok.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/