[patch 00/40] CPU hotplug rework - episode I

From: Thomas Gleixner
Date: Thu Jan 31 2013 - 10:44:20 EST

The current CPU hotplug implementation has become an increasing
nightmare full of races and undocumented behaviour. The main issue of
the current hotplug scheme is the completely asymetric
startup/teardown process. The hotplug notifiers are mostly
undocumented and the CPU_* actions in lots of implementations seem to
be randomly chosen.

We had a long discussion in San Diego last year about reworking the
hotplug core into a fully symetric state machine. After a few doomed
attempts to convert the existing code into a state machine, I finally
found a workable solution.

The following patch series implements a trivial array based state
machine, which replaces the existing steps in cpu_up/down and also the
notifiers which must run on the hotplugged cpu are converted to a
callback array. This documents clearly the ordering of the callbacks
and also makes the asymetric behaviour very obvious.

This series converts the stop_machine thread to the smpboot
infrastructure, implements the core state machine and converts all
notifiers which have ordering constraints plus a randomly chosen bunch
of other notifiers to the state machine.

The runtime installed callbacks are immediately executed by the core
code on or on behalf of all cpus which have already reached the
corresponding state. A non executing installer function is there as
well to allow simple migration of the existing notifier maze.

The diffstat of the complete series is appended below.

36 files changed, 1300 insertions(+), 1179 deletions(-)

We add slightly more code at this stage (225 lines alone in a header
file), but most of the conversions are removing code and we have only
tackled about 30 of 130+ instances. Even with the current conversion
state, the resulting text size shrinks already.

Known issues:
The current series has a not yet solved section mismatch issue versus
the array callbacks which are already installed at compile time.

There is more work in the pipeline:

- Convert all notifiers to the state machine callbacks

- Analyze the asymetric callbacks and fix them if possible or at
least document why they need to be asymetric.

- Unify the low level bringup across the architectures
(e.g. synchronization between boot and hotplugged cpus, common
setups, scheduler exposure, etc.)

At the end hotplug should run through an array of callbacks on both
sides with explicit core synchronization points. The ordering should
look like this:

CPUHP_OFFLINE // Start state.
CPUHP_PREP_<hardware> // Kick CPU into life / let it die
CPUHP_PREP_<datastructures> // Get datastructures set up / freed.
CPUHP_PREP_<threads> // Create threads for cpu
CPUHP_SYNC // Synchronization point
CPUHP_INIT_<hardware> // Startup/teardown on the CPU (interrupts, timers ...)
CPUHP_SCHED_<stuff on CPU> // Unpark/park per cpu local threads on the CPU.
CPUHP_ENABLE_<stuff_on_CPU> // Enable/disable facilities
CPUHP_SYNC // Synchronization point
CPUHP_SCHED // Expose/remove CPU from general scheduler.
CPUHP_ONLINE // Final state

All PREP states can fail and the corresponding teardown callbacks are
invoked in the same way as they are invoked on offlining.

The existing DOWN_PREPARE notifier has only two instances which
actually might prevent the CPU from going down: rcu_tree and
padata. We might need to keep them, but these can be explicitly
documented asymetric states.

Quite some of the ONLINE/DOWN_PREPARE notifiers are racy and need a
proper inspection. All other valid users of ONLINE/DOWN_PREPARE
notifiers should be put into the CPUHP_ENABLE state block and be
executed on the hotplugged CPU. I have not seen a single instance
(except scheduler) which needs to be executed before we remove the CPU
from the general scheduler itself.

This final design needs quite some massaging of the current scheduler
code, but last time I discussed this with scheduler folks it seemed to
be doable with a reasonable effort. Other than that I don't see any
(un)real showstoppers on the horizon.


arch/arm/kernel/perf_event_cpu.c | 28 -
arch/arm/vfp/vfpmodule.c | 29 -
arch/blackfin/kernel/perf_event.c | 25 -
arch/powerpc/perf/core-book3s.c | 29 -
arch/s390/kernel/perf_cpum_cf.c | 37 -
arch/s390/kernel/vtime.c | 18
arch/sh/kernel/perf_event.c | 22
arch/x86/kernel/apic/x2apic_cluster.c | 80 +--
arch/x86/kernel/cpu/perf_event.c | 78 +--
arch/x86/kernel/cpu/perf_event_amd.c | 6
arch/x86/kernel/cpu/perf_event_amd_ibs.c | 54 --
arch/x86/kernel/cpu/perf_event_intel.c | 6
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 109 +---
arch/x86/kernel/tboot.c | 23
drivers/clocksource/arm_generic.c | 40 -
drivers/cpufreq/cpufreq_stats.c | 55 --
include/linux/cpu.h | 45 -
include/linux/cpuhotplug.h | 207 ++++++++
include/linux/perf_event.h | 21
include/linux/smpboot.h | 5
init/main.c | 15
kernel/cpu.c | 613 ++++++++++++++++++++++----
kernel/events/core.c | 36 -
kernel/hrtimer.c | 47 -
kernel/profile.c | 92 +--
kernel/rcutree.c | 95 +---
kernel/sched/core.c | 251 ++++------
kernel/sched/fair.c | 16
kernel/smp.c | 50 --
kernel/smpboot.c | 11
kernel/smpboot.h | 4
kernel/stop_machine.c | 154 ++----
kernel/time/clockevents.c | 13
kernel/timer.c | 43 -
kernel/workqueue.c | 80 +--
virt/kvm/kvm_main.c | 42 -
36 files changed, 1300 insertions(+), 1179 deletions(-)

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/