[PATCH v7 0/2] CPU hotplug: Fix the long-standing "IPI to offline CPU" issue

From: Srivatsa S. Bhat
Date: Mon May 26 2014 - 07:09:36 EST



Hi,

There is a long-standing problem related to CPU hotplug which causes IPIs to
be delivered to offline CPUs, and the smp-call-function IPI handler code
prints out a warning whenever this is detected. Every once in a while this
(usually harmless) warning gets reported on LKML, but so far it has not been
completely fixed. Usually the solution involves finding out the IPI sender
and fixing it by adding appropriate synchronization with CPU hotplug.

However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more specifically,
in stop-machine) that can lead to this problem even when the sender code
is perfectly fine. This patchset handles that scenario to ensure that a
CPU doesn't go offline with callbacks still pending.

Patch 1 adds some additional debug code to the smp-call-function framework,
to help debug such issues easily.

Patch 2 adds a mechanism to flush any pending smp-call-function callbacks
queued on the CPU going offline (including those callbacks for which the
IPIs from the source CPUs might not have arrived in time at the outgoing CPU).
This ensures that a CPU never goes offline with work still pending. Also,
the warning condition in smp-call-function IPI handler code is modified to
trigger only if an IPI is received on an offline CPU *and* it still has
pending callbacks to execute, since that's the only remaining buggy scenario
after applying this patch.


In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq() function.
But I was not able to find anything wrong with the block layer code. That's
when I started looking at the stop-machine code and realized that there is
a race-window which makes the IPI _receiver_ the culprit, not the sender.
Patch 2 handles this scenario and hence this should put an end to most of
the hard-to-debug IPI-to-offline-CPU issues.



Changes in v7:
* Modified the warning condition in smp-call-function IPI handler code, such
that it triggers only if an offline CPU got an IPI *and* it still had pending
callbacks to execute.
* Completely dropped the patch that modified the stop-machine code to
introduce additional states to order the disabling of interrupts on various
CPUs. This strict ordering is not necessary any more after the first change.
Thanks to Frederic Weisbecker for suggesting this enhancement.

Changes in v6:
Modified Patch 3 to flush the pending callbacks from CPU_DYING notifier
instead of stop-machine directly, so that only the CPU hotplug path will
run this code, instead of everybody who uses stop-machine. Suggested by
Peter Zijlstra.

Changes in v5:
Added Patch 3 to flush out any pending smp-call-function callbacks on the
outgoing CPU, as suggested by Frederic Weisbecker.

Changes in v4:
Rewrote a comment in Patch 2 and reorganized the code for better readability.

Changes in v3:
Rewrote patch 2 and split the MULTI_STOP_DISABLE_IRQ state into two:
MULTI_STOP_DISABLE_IRQ_INACTIVE and MULTI_STOP_DISABLE_IRQ_ACTIVE, and
used this framework to ensure that the CPU going offline always disables
its interrupts last. Suggested by Tejun Heo.

v1 and v2:
https://lkml.org/lkml/2014/5/6/474


Srivatsa S. Bhat (2):
smp: Print more useful debug info upon receiving IPI on an offline CPU
CPU hotplug, smp: Flush any pending IPI callbacks before CPU offline


kernel/smp.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 60 insertions(+), 8 deletions(-)


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/