Re: [RFC PATCH 01/10] CPU hotplug: Introduce "stable" cpu onlinemask, for atomic hotplug readers

From: Michael Wang
Date: Tue Dec 04 2012 - 21:56:14 EST


On 12/05/2012 06:10 AM, Andrew Morton wrote:
> On Tue, 04 Dec 2012 14:23:41 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@xxxxxxxxxxxxxxxxxx> wrote:
>
>> From: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
>>
>> There are places where preempt_disable() is used to prevent any CPU from
>> going offline during the critical section. Let us call them as "atomic
>> hotplug readers" (atomic because they run in atomic contexts).
>>
>> Often, these atomic hotplug readers have a simple need : they want the cpu
>> online mask that they work with (inside their critical section), to be
>> stable, i.e., it should be guaranteed that CPUs in that mask won't go
>> offline during the critical section. An important point here is that they
>> don't really care if such a "stable" mask is a subset of the actual
>> cpu_online_mask.
>>
>> The intent of this patch is to provide such a "stable" cpu online mask
>> for that class of atomic hotplug readers.
>>
>> Fundamental idea behind the design:
>> -----------------------------------
>>
>> Simply put, have a separate mask called the stable cpu online mask; and
>> at the hotplug writer (cpu_down()), note down the CPU that is about to go
>> offline, and remove it from the stable cpu online mask. Then, feel free
>> to take that CPU offline, since the atomic hotplug readers won't see it
>> from now on. Also, don't start any new cpu_down() operations until all
>> existing atomic hotplug readers have completed (because they might still
>> be working with the old value of the stable online mask).
>>
>> Some important design requirements and considerations:
>> -----------------------------------------------------
>>
>> 1. The atomic hotplug readers should ideally *never* wait for the hotplug
>> writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
>> can be in very hot-paths like interrupt handling/IPI and hence, if they
>> have to wait for an ongoing cpu_down() to complete, it would pretty much
>> introduce the same performance/latency problems as stop_machine().
>>
>> 2. Any synchronization at the atomic hotplug readers side must be highly
>> scalable - avoid global locks/counters etc. Because, these paths currently
>> use the extremely fast preempt_disable(); our replacement to
>> preempt_disable() should not become ridiculously costly.
>>
>> 3. preempt_disable() was recursive. The replacement should also be recursive.
>>
>> Implementation of the design:
>> ----------------------------
>>
>> Atomic hotplug reader side:
>>
>> We use per-cpu counters to mark the presence of atomic hotplug readers.
>> A reader would increment its per-cpu counter and continue, without waiting
>> for anything. And no locks are used in this path. Together, these satisfy
>> all the 3 requirements mentioned above.
>>
>> The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
>> ensure that all existing atomic hotplug readers have completed. Only after
>> that, it will proceed to actually take the CPU offline.
>>
>> [ Michael: Designed the synchronization for the IPI case ]
>
> Like this:
>
> [wangyun@xxxxxxxxxxxxxxxxxx: designed the synchronization for the IPI case]
>
>> Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
>> [ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
>>
>> ...
>>
>> --- a/include/linux/cpu.h
>> +++ b/include/linux/cpu.h
>> @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
>>
>> extern void get_online_cpus(void);
>> extern void put_online_cpus(void);
>> +extern void get_online_cpus_stable_atomic(void);
>> +extern void put_online_cpus_stable_atomic(void);
>> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
>> #define register_hotcpu_notifier(nb) register_cpu_notifier(nb)
>> #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb)
>> @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
>>
>> #define get_online_cpus() do { } while (0)
>> #define put_online_cpus() do { } while (0)
>> +#define get_online_cpus_stable_atomic() do { } while (0)
>> +#define put_online_cpus_stable_atomic() do { } while (0)
>
> static inline C functions would be preferred if possible. Feel free to
> fix up the wrong crufty surrounding code as well ;)
>
>>
>> ...
>>
>> @@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
>>
>> #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
>> #define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask)
>> +#define for_each_online_cpu_stable(cpu) \
>> + for_each_cpu((cpu), cpu_online_stable_mask)
>> #define for_each_present_cpu(cpu) for_each_cpu((cpu), cpu_present_mask)
>>
>> /* Wrappers for arch boot code to manipulate normally-constant masks */
>> void set_cpu_possible(unsigned int cpu, bool possible);
>> void set_cpu_present(unsigned int cpu, bool present);
>> void set_cpu_online(unsigned int cpu, bool online);
>> +void set_cpu_online_stable(unsigned int cpu, bool online);
>
> The naming is inconsistent. This is "cpu_online_stable", but
> for_each_online_cpu_stable() is "online_cpu_stable". Can we make
> everything the same?
>
>> void set_cpu_active(unsigned int cpu, bool active);
>> void init_cpu_present(const struct cpumask *src);
>> void init_cpu_possible(const struct cpumask *src);
>>
>> ...
>>
>> +void get_online_cpus_stable_atomic(void)
>> +{
>> + preempt_disable();
>> + this_cpu_inc(hotplug_reader_refcount);
>> +
>> + /*
>> + * Prevent reordering of writes to hotplug_reader_refcount and
>> + * reads from cpu_online_stable_mask.
>> + */
>> + smp_mb();
>> +}
>> +EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
>> +
>> +void put_online_cpus_stable_atomic(void)
>> +{
>> + /*
>> + * Prevent reordering of reads from cpu_online_stable_mask and
>> + * writes to hotplug_reader_refcount.
>> + */
>> + smp_mb();
>> + this_cpu_dec(hotplug_reader_refcount);
>> + preempt_enable();
>> +}
>> +EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
>> +
>> static struct {
>> struct task_struct *active_writer;
>> struct mutex lock; /* Synchronizes accesses to refcount, */
>> @@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
>> write_unlock_irq(&tasklist_lock);
>> }
>>
>> +/*
>> + * We want all atomic hotplug readers to refer to the new value of
>> + * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
>> + * to complete. Any new atomic hotplug readers will see the updated mask
>> + * and hence pose no problems.
>> + */
>> +static void sync_hotplug_readers(void)
>> +{
>> + unsigned int cpu;
>> +
>> + for_each_online_cpu(cpu) {
>> + while (per_cpu(hotplug_reader_refcount, cpu))
>> + cpu_relax();
>> + }
>> +}
>
> That all looks solid to me.
>
>> +/*
>> + * We are serious about taking this CPU down. So clear the CPU from the
>> + * stable online mask.
>> + */
>> +static void prepare_cpu_take_down(unsigned int cpu)
>> +{
>> + set_cpu_online_stable(cpu, false);
>> +
>> + /*
>> + * Prevent reordering of writes to cpu_online_stable_mask and reads
>> + * from hotplug_reader_refcount.
>> + */
>> + smp_mb();
>> +
>> + /*
>> + * Wait for all active hotplug readers to complete, to ensure that
>> + * there are no critical sections still referring to the old stable
>> + * online mask.
>> + */
>> + sync_hotplug_readers();
>> +}
>
> I wonder about the cpu-online case. A typical caller might want to do:
>
>
> /*
> * Set each online CPU's "foo" to "bar"
> */
>
> int global_bar;
>
> void set_cpu_foo(int bar)
> {
> get_online_cpus_stable_atomic();
> global_bar = bar;
> for_each_online_cpu_stable()
> cpu->foo = bar;
> put_online_cpus_stable_atomic()
> }
>
> void_cpu_online_notifier_handler(void)
> {
> cpu->foo = global_bar;
> }
>
> And I think that set_cpu_foo() would be buggy, because a CPU could come
> online before global_bar was altered, and that newly-online CPU would
> pick up the old value of `bar'.
>
> So what's the rule here? global_bar must be written before we run
> get_online_cpus_stable_atomic()?
>
> Anyway, please have a think and spell all this out?

That's right, actually this related to one question, should the hotplug
happen during get_online and put_online?

Answer will be YES according to old API which using mutex, the hotplug
won't happen in critical section, but the cost is get_online() will
block, which will kill the performance.

So we designed this mechanism to do acceleration, but as you pointed
out, although the result will never be wrong, but the 'stable' mask is
not stable since it could be changed in critical section.

And we have two solution.

One is from Srivatsa, using 'read_lock' and 'write_lock', it will
prevent hotplug happen just like the old rule, the cost is we need a
global 'rw_lock' which perform bad on NUMA system, and no doubt,
get_online() will block for short time when doing hotplug.

Another is to maintain a per-cpu cache mask, this mask will only be
updated in get_online(), and be used in critical section, then we will
get a real stable mask, but one flaw is, on different cpu in critical
section, online mask will be different.

We will be appreciate if we could collect some comments on which one to
be used in next version.

Regards,
Michael Wang

>
>> struct take_cpu_down_param {
>> unsigned long mod;
>> void *hcpu;
>> @@ -246,7 +351,9 @@ struct take_cpu_down_param {
>> static int __ref take_cpu_down(void *_param)
>> {
>> struct take_cpu_down_param *param = _param;
>> - int err;
>> + int err, cpu = (long)(param->hcpu);
>> +
>
> Like this please:
>
> int err;
> int cpu = (long)(param->hcpu);
>
>> + prepare_cpu_take_down(cpu);
>>
>> /* Ensure this CPU doesn't handle any more interrupts. */
>> err = __cpu_disable();
>>
>> ...
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/