[PATCH] mutex: Speed up mutex_spin_on_owner() by not taking the RCU lock

From: Ingo Molnar
Date: Fri Apr 10 2015 - 05:01:07 EST



* Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:

> > And if CONFIG_DEBUG_PAGEALLOC is set, we don't care about
> > performance *at*all*. We will have worse performance problems than
> > doing some RCU read-locking inside the loop.
> >
> > And if CONFIG_DEBUG_PAGEALLOC isn't set, we don't really care
> > about locking, since at worst we just access stale memory for one
> > iteration.
>
> But if we are running on a hypervisor, mightn't our VCPU be
> preempted just before accessing ->on_cpu, the task exit and its
> structures be freed and unmapped? Or is the task structure in
> memory that is never unmapped? (If the latter, clearly not a
> problem.)

kmalloc()able kernel memory is never unmapped in that fashion [*].
Even hotplug memory is based on limiting what gets allocated in that
area and never putting critical kernel data structures there.

Personally I'd be more comfortable with having a special primitive for
this that is DEBUG_PAGEALLOC aware (Linus's first suggestion), so that
we don't use different RCU primitives in the rare case someone tests
CONFIG_DEBUG_PAGEALLOC=y ...

We even have such a primitive: __copy_from_user_inatomic(). It
compiles to a single instruction for integer types on x86. I've
attached a patch that implements it for the regular mutexes (xadd can
be done too), and it all compiles to a rather sweet, compact routine:

0000000000000030 <mutex_spin_on_owner.isra.4>:
30: 48 3b 37 cmp (%rdi),%rsi
33: 48 8d 4e 28 lea 0x28(%rsi),%rcx
37: 75 4e jne 87 <mutex_spin_on_owner.isra.4+0x57>
39: 55 push %rbp
3a: 45 31 c0 xor %r8d,%r8d
3d: 65 4c 8b 0c 25 00 00 mov %gs:0x0,%r9
44: 00 00
46: 48 89 e5 mov %rsp,%rbp
49: 48 83 ec 10 sub $0x10,%rsp
4d: eb 08 jmp 57 <mutex_spin_on_owner.isra.4+0x27>
4f: 90 nop
50: f3 90 pause
52: 48 3b 37 cmp (%rdi),%rsi
55: 75 29 jne 80 <mutex_spin_on_owner.isra.4+0x50>
57: 44 89 c0 mov %r8d,%eax
5a: 90 nop
5b: 90 nop
5c: 90 nop
5d: 8b 11 mov (%rcx),%edx
5f: 90 nop
60: 90 nop
61: 90 nop
62: 85 d2 test %edx,%edx
64: 89 55 fc mov %edx,-0x4(%rbp)
67: 74 0b je 74 <mutex_spin_on_owner.isra.4+0x44>
69: 49 8b 81 10 c0 ff ff mov -0x3ff0(%r9),%rax
70: a8 08 test $0x8,%al
72: 74 dc je 50 <mutex_spin_on_owner.isra.4+0x20>
74: 31 c0 xor %eax,%eax
76: c9 leaveq
77: c3 retq
78: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
7f: 00
80: b8 01 00 00 00 mov $0x1,%eax
85: c9 leaveq
86: c3 retq
87: b8 01 00 00 00 mov $0x1,%eax
8c: c3 retq
8d: 0f 1f 00 nopl (%rax)

No RCU overhead, and this is the access to owner->on_cpu:

69: 49 8b 81 10 c0 ff ff mov -0x3ff0(%r9),%rax

Totally untested and all that, I only built the mutex.o.

What do you think? Am I missing anything?

Thanks,

Ingo

[*] with the exception of CONFIG_DEBUG_PAGEALLOC and other debug
mechanisms like CONFIG_KMEMCHECK (which is on the way out) that
are based on provoking page faults and fixing up page tables to
catch unexpected memory accesses.

=================================>