Re: Question on mutex code

From: Matthias Bonne
Date: Sat Mar 14 2015 - 19:06:00 EST

Next message: Matthias Bonne: "Re: Question on mutex code"
Previous message: Andy Lutomirski: "Re: [RFC] capabilities: Ambient capabilities"
In reply to: Matthias Bonne: "Re: Question on mutex code"
Next in thread: Davidlohr Bueso: "Re: Question on mutex code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 03/10/15 15:03, Yann Droneaud wrote:

Hi,

Le mercredi 04 mars 2015 Ã 02:13 +0200, Matthias Bonne a Ãcrit :

I am trying to understand how mutexes work in the kernel, and I think
there might be a race between mutex_trylock() and mutex_unlock(). More
specifically, the race is between the functions
__mutex_trylock_slowpath and __mutex_unlock_common_slowpath (both
defined in kernel/locking/mutex.c).

Consider the following sequence of events:

[...]

The end result is that the mutex count is 0 (locked), although the
owner has just released it, and nobody else is holding the mutex. So it
can no longer be acquired by anyone.

Am I missing something that prevents the above scenario from happening?
If not, should I post a patch that fixes it to LKML? Or is it
considered too "theoretical" and cannot happen in practice?

I haven't looked at your explanations, you should have come with a
reproductible test case to demonstrate the issue (involving slowing
down one CPU ?).

Anyway, such deep knowledge on the mutex implementation has to be found
on lkml.

Regards.

Thank you for your suggestions, and sorry for the long delay.

I see now that my explanation was unneccesarily complex. The problem is
this code from __mutex_trylock_slowpath():

spin_lock_mutex(&lock->wait_lock, flags);

prev = atomic_xchg(&lock->count, -1);
if (likely(prev == 1)) {
mutex_set_owner(lock);
mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
}

/* Set it back to 0 if there are no waiters: */
if (likely(list_empty(&lock->wait_list)))
atomic_set(&lock->count, 0);

spin_unlock_mutex(&lock->wait_lock, flags);

return prev == 1;

The above code assumes that the mutex cannot be unlocked while the
spinlock is held. However, mutex_unlock() sets the mutex count to 1
before taking the spinlock (even in the slowpath). If this happens
between the atomic_xchg() and the atomic_set() above, and the mutex has
no waiters, then the atomic_set() will set the mutex count back to 0
after it has been unlocked by mutex_unlock(), but mutex_trylock() will
still return failure. So the mutex will remain locked forever.

I don't know how to write a test case to demonstrate the issue, because
this race is very hard to trigger in practice: the mutex needs to be
locked immediately before the spinlock is acquired, and unlocked in the
very short interval between atomic_xchg() and atomic_set(). It also
requires that CONFIG_DEBUG_MUTEXES be set, since AFAICT the mutex
debugging code is currently the only user of __mutex_trylock_slowpath.
This is why I asked if it is acceptable to submit a patch for such
hard-to-trigger problems.

I think I will just send a fix. Any further suggestions or guidance
would be appreciated.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Matthias Bonne: "Re: Question on mutex code"
Previous message: Andy Lutomirski: "Re: [RFC] capabilities: Ambient capabilities"
In reply to: Matthias Bonne: "Re: Question on mutex code"
Next in thread: Davidlohr Bueso: "Re: Question on mutex code"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]