[BUG] Possible locking issues in stop_machine code on 6k core machine

From: Alex Thorlton
Date: Wed Dec 03 2014 - 15:40:51 EST


Hey guys,

While working to get our newly upgraded 6k core machine online, we've
discovered a few possible locking issues in the stop_machine code that
we're trying to get sorted out. (We think) the problems we're seeing
stem from possible interaction between stop_cpus and stop_one_cpu. The
issue presents as a deadlock, and seems to only show itself
intermittently.

After quite a bit of debugging we think we've narrowed the issue down to
the fact that stop_one_cpu does not respect many of the locks that are
taken in the stop_cpus code path. For reference the stop_cpus code path
takes the stop_cpus_mutex, then stop_cpus_lock, and then takes each
cpu's stopper->lock. stop_one_cpu seems to rely solely on the
stopper->lock.

What appears to be happening to cause our deadlock is, stop_cpus works
its way down to queue_stop_cpus_work, which tells each cpu's stopper
task to wake up, take its lock, and do its work. As the loop that does
this progresses, the lowest numbered cpus complete their work, and are
allowed to go on about their business. The problem occurs when one of
these lower numbered cpus calls stop_one_cpu, targeting one of the
higher numbered cpus, which the stop_cpus loop has not yet reached. If
this happens, that higher numbered cpu's completion variable will get
stomped on, and the wait_for_completion in the stop_cpus code path will
never return.

A quick example: CPU 0 calls stop_cpus, which will hit all 6,000 cores.
CPU 50 completes its stopper work, and at some point in the near future
calls stop_one_cpu on CPU 5000. This clobbers CPU 5000's pointer to the
cpu_stop_done struct set up in queue_stop_cpus_work, meaning that, once
CPU 5000 completes its work, it won't be able to decrement the nr_todo
for the correct cpu_stop_done struct, and CPU 0's wait_for_completion
will never return.

Again, much of this is semi-educated guesswork, put together based on
information gathered from examining lots of debug output, in an attempt
to spot the problem. We're fairly certain that we've pinned down our
issue, but we'd like to ask those who are more knowledgeable of these
code paths to weigh in their opinions here.

We'd really appreciate any help that anyone can offer. Thanks!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/