[RFC]: Possible race condition in kernel futex code

From: Jaccon Bastiaansen
Date: Mon Oct 05 2015 - 11:26:05 EST


Hello all,

For a while now, we see our application crashing occasionally (due to
an assert) in the middle of a pthread_mutex_lock() function call.

To be more precise, our application is threaded (50 threads) where all
threads share a custom memory allocator. This memory allocator is
guarded with a recursive, priority inheritance pthread mutex. The
application is running on a 8 core, 64 bit x86 system using the
3.12.42-rt58 kernel and glibc 2.13.

The assert occurs when the pthread_mutex_lock() function checks the
return value of the futex system call (see line 307 of the
pthread_mutex_lock() source code on
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
In case of the assert, the kernel has returned -EDEADLK which results
in a SIGABRT.

The strange thing is that the kernel returns -EDEADLK, because it
finds out that the calling thread is the owner of the mutex. But glibc
has already checked this before executing the futex system call (see
line 194 of http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
The case where a recursive mutex is locked again is fully handled in
glibc. No futex system call is required.

The kernel futex code finds out that the calling thread is the owner
of the mutex by reading the value of the __data.__lock field of the
pthread_mutex_t. This field is read by the get_futex_value_locked()
function (see kernel/futex.c). The __data.__lock field in a
pthread_mutex_t is written by an atomic 32bit compare exchange when
locking the mutex (see line 288 of
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
Reading the value by the kernel should be an atomic action.

Much to our surprise, the disassembly of the get_futex_value_locked()
function is:

ffffffff81098890 <get_futex_value_locked>:
ffffffff81098890: 55 push %rbp
ffffffff81098891: 48 89 e5 mov %rsp,%rbp
ffffffff81098894: 48 83 ec 10 sub $0x10,%rsp
ffffffff81098898: 48 89 5d f0 mov %rbx,-0x10(%rbp)
ffffffff8109889c: 48 89 fb mov %rdi,%rbx
ffffffff8109889f: 4c 89 65 f8 mov %r12,-0x8(%rbp)
ffffffff810988a3: 49 89 f4 mov %rsi,%r12
ffffffff810988a6: e8 35 e9 06 00 callq ffffffff811071e0
<pagefault_disable>
ffffffff810988ab: 48 89 df mov %rbx,%rdi
ffffffff810988ae: 4c 89 e6 mov %r12,%rsi
ffffffff810988b1: ba 04 00 00 00 mov $0x4,%edx
ffffffff810988b6: e8 85 f0 2b 00 callq ffffffff81357940
<copy_user_generic_unrolled>
ffffffff810988bb: 89 c3 mov %eax,%ebx
ffffffff810988bd: e8 fe e8 06 00 callq ffffffff811071c0
<pagefault_enable>
ffffffff810988c2: 83 fb 01 cmp $0x1,%ebx
ffffffff810988c5: 4c 8b 65 f8 mov -0x8(%rbp),%r12
ffffffff810988c9: 19 c0 sbb %eax,%eax
ffffffff810988cb: 48 8b 5d f0 mov -0x10(%rbp),%rbx
ffffffff810988cf: c9 leaveq
ffffffff810988d0: f7 d0 not %eax
ffffffff810988d2: 83 e0 f2 and $0xfffffff2,%eax
ffffffff810988d5: c3 retq
ffffffff810988d6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
ffffffff810988dd: 00 00 00

So the copy_user_generic_unrolled() function is called to read the
32bit __data.__lock field. The disassembly of the
copy_user_generic_unrolled() function:


ffffffff81357940 <copy_user_generic_unrolled>:
ffffffff81357940: 83 fa 08 cmp $0x8,%edx
ffffffff81357943: 0f 82 8c 00 00 00 jb ffffffff813579d5
<copy_user_generic_unrolled+0x95>
ffffffff81357949: 89 f9 mov %edi,%ecx
ffffffff8135794b: 83 e1 07 and $0x7,%ecx
ffffffff8135794e: 74 15 je ffffffff81357965
<copy_user_generic_unrolled+0x25>
ffffffff81357950: 83 e9 08 sub $0x8,%ecx
ffffffff81357953: f7 d9 neg %ecx
ffffffff81357955: 29 ca sub %ecx,%edx
ffffffff81357957: 8a 06 mov (%rsi),%al
ffffffff81357959: 88 07 mov %al,(%rdi)
ffffffff8135795b: 48 ff c6 inc %rsi
ffffffff8135795e: 48 ff c7 inc %rdi
ffffffff81357961: ff c9 dec %ecx
ffffffff81357963: 75 f2 jne ffffffff81357957
<copy_user_generic_unrolled+0x17>
ffffffff81357965: 89 d1 mov %edx,%ecx
ffffffff81357967: 83 e2 3f and $0x3f,%edx
ffffffff8135796a: c1 e9 06 shr $0x6,%ecx
ffffffff8135796d: 74 4a je ffffffff813579b9
<copy_user_generic_unrolled+0x79>
ffffffff8135796f: 4c 8b 06 mov (%rsi),%r8
ffffffff81357972: 4c 8b 4e 08 mov 0x8(%rsi),%r9
ffffffff81357976: 4c 8b 56 10 mov 0x10(%rsi),%r10
ffffffff8135797a: 4c 8b 5e 18 mov 0x18(%rsi),%r11
ffffffff8135797e: 4c 89 07 mov %r8,(%rdi)
ffffffff81357981: 4c 89 4f 08 mov %r9,0x8(%rdi)
ffffffff81357985: 4c 89 57 10 mov %r10,0x10(%rdi)
ffffffff81357989: 4c 89 5f 18 mov %r11,0x18(%rdi)
ffffffff8135798d: 4c 8b 46 20 mov 0x20(%rsi),%r8
ffffffff81357991: 4c 8b 4e 28 mov 0x28(%rsi),%r9
ffffffff81357995: 4c 8b 56 30 mov 0x30(%rsi),%r10
ffffffff81357999: 4c 8b 5e 38 mov 0x38(%rsi),%r11
ffffffff8135799d: 4c 89 47 20 mov %r8,0x20(%rdi)
ffffffff813579a1: 4c 89 4f 28 mov %r9,0x28(%rdi)
ffffffff813579a5: 4c 89 57 30 mov %r10,0x30(%rdi)
ffffffff813579a9: 4c 89 5f 38 mov %r11,0x38(%rdi)
ffffffff813579ad: 48 8d 76 40 lea 0x40(%rsi),%rsi
ffffffff813579b1: 48 8d 7f 40 lea 0x40(%rdi),%rdi
ffffffff813579b5: ff c9 dec %ecx
ffffffff813579b7: 75 b6 jne ffffffff8135796f
<copy_user_generic_unrolled+0x2f>
ffffffff813579b9: 89 d1 mov %edx,%ecx
ffffffff813579bb: 83 e2 07 and $0x7,%edx
ffffffff813579be: c1 e9 03 shr $0x3,%ecx
ffffffff813579c1: 74 12 je ffffffff813579d5
<copy_user_generic_unrolled+0x95>
ffffffff813579c3: 4c 8b 06 mov (%rsi),%r8
ffffffff813579c6: 4c 89 07 mov %r8,(%rdi)
ffffffff813579c9: 48 8d 76 08 lea 0x8(%rsi),%rsi
ffffffff813579cd: 48 8d 7f 08 lea 0x8(%rdi),%rdi
ffffffff813579d1: ff c9 dec %ecx
ffffffff813579d3: 75 ee jne ffffffff813579c3
<copy_user_generic_unrolled+0x83>
ffffffff813579d5: 21 d2 and %edx,%edx
ffffffff813579d7: 74 10 je ffffffff813579e9
<copy_user_generic_unrolled+0xa9>
ffffffff813579d9: 89 d1 mov %edx,%ecx
ffffffff813579db: 8a 06 mov (%rsi),%al
ffffffff813579dd: 88 07 mov %al,(%rdi)
ffffffff813579df: 48 ff c6 inc %rsi
ffffffff813579e2: 48 ff c7 inc %rdi
ffffffff813579e5: ff c9 dec %ecx
ffffffff813579e7: 75 f2 jne ffffffff813579db
<copy_user_generic_unrolled+0x9b>
ffffffff813579e9: 31 c0 xor %eax,%eax
ffffffff813579eb: c3 retq
ffffffff813579ec: 0f 1f 40 00 nopl 0x0(%rax)

shows that that for byte count values less than 8 (the check done at
address ffffffff81357940) , a byte copy loop is used (starting at
address ffffffff813579d5).


This means that we have 2 problems here:
1) The __data.__lock field is written by glibc with an atomic 32 bit
compare exchange, but read by the kernel with 4 separate byte
accesses. This violates the following statement on page 8-4 of the
Intel Software Developer's Manual V3
(http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html):

"Software should access semaphores (shared memory used for signalling
between multiple processors) using identical addresses and operand
lengths. For example, if one processor accesses a semaphore using a
word access, other processors should not access the semaphore using a
byte access."

2) The following race condition is possible:
- Thread x has locked the mutex guarding the memory allocator.
- Another thread y (possibly running on another CPU) attempts to lock
this mutex. Glibc first checks whether thread y has already locked
this mutex, which isn't the case.
- Thread y attempts to lock the mutex by the atomic 32 bit compare
exchange (on line 288 of
http://osxr.org/glibc/source/nptl/pthread_mutex_lock.c?v=glibc-2.13).
This compare exchange fails, because thread x still has the mutex
locked.
- Thread y now starts the futex system call. The kernel starts reading
the __data.__lock field of the mutex with 4 separate byte reads.
- After thread y has read the first byte and before it has read the
second byte, thread x releases the mutex with a 32bit atomic compare
exchange (see line 217 of
http://osxr.org/glibc/source/nptl/pthread_mutex_unlock.c?v=glibc-2.13).
- Thread y now continues reading the __data.__lock field and ends up
with an inconsistent __data.__lock field value (the least significant
byte corresponds to the __data.__lock value before thread y released
the mutex, while the other bytes correspond to the __data.__lock value
after thread y has released the mutex).
- This inconsistent value can result in an incorrect decision to
return -EDEADLK and maybe other problems.

This problem is caused by the fact that, for some reason, the compiler
decided to use the copy_user_generic_unrolled() to read a 32bit value
instead of using a 32bit read instruction which is atomic on an x86.
The kernel __copy_from_user_inatomic code is written in such a way the
compiler can figure out it should use a 32bit read, but in this case
it fails to figure this out.

We did some tests with different compilers, kernel versions and kernel
configs, with the following results:

Linux 3.12.48, x86_64_defconfig, GCC 4.6.1 :
copy_user_generic_unrolled being used, so race condition possible
Linux 3.12.48, x86_64_defconfig, GCC 4.9.1 :
copy_user_generic_unrolled being used, so race condition possible
Linux 4.2.3, x86_64_defconfig, GCC 4.6.1 : 32 bit read being used, no
race condition
Linux 4.2.3, x86_64_defconfig, GCC 4.9.1 : 32 bit read being used, no
race condition


Our idea to fix this problem is use an explicit 32 bit read in
get_futex_value_locked() instead of using the generic function
copy_from_user_inatomic() and hoping the compiler uses an atomic
access and the right access size.

What is your opinion on this?

Regards,
Jaccon Bastiaansen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/