Re: [PATCH 2/3] generic-ipi: remove kmalloc()

From: Peter Zijlstra
Date: Wed Feb 18 2009 - 05:43:03 EST


On Tue, 2009-02-17 at 16:28 -0800, Paul E. McKenney wrote:
> On Tue, Feb 17, 2009 at 10:59:06PM +0100, Peter Zijlstra wrote:

> > +static int
> > +hotplug_cfd(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > +{
> > + long cpu = (long)hcpu;
> > + struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
> > +
> > + switch (action) {
> > + case CPU_UP_PREPARE:
> > + case CPU_UP_PREPARE_FROZEN:
> > + if (!alloc_cpumask_var_node(&cfd->cpumask, GFP_KERNEL,
> > + cpu_to_node(cpu)))
> > + return NOTIFY_BAD;
> > + break;
> > +
> > +#ifdef CONFIG_CPU_HOTPLUG
> > + case CPU_UP_CANCELED:
> > + case CPU_UP_CANCELED_FROZEN:
> > +
> > + case CPU_DEAD:
> > + case CPU_DEAD_FROZEN:
> > + free_cpumask_var(cfd->cpumask);
> > + break;
> > +#endif
> > +
> > + return NOTIFY_OK;
> > + };
> > +}
>
> Hmmm.... Why not the following? Do we really need to free the cpumask
> when a CPU departs, given that it might return later?

Probably not, but that's what we have these callbacks for, might as well
use them.

> > +/*
> > + * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
> > + *
> > + * For non-synchronous ipi calls the csd can still be in use by the previous
> > + * function call. For multi-cpu calls its even more interesting as we'll have
> > + * to ensure no other cpu is observing our csd.
> > + */
> > +static void csd_lock(struct call_single_data *data)
> > {
> > - /* Wait for response */
> > - do {
> > - if (!(data->flags & CSD_FLAG_WAIT))
> > - break;
> > + while (data->flags & CSD_FLAG_LOCK)
> > cpu_relax();
> > - } while (1);
> > + data->flags = CSD_FLAG_LOCK;
>
> We do need an smp_mb() here, otherwise, the call from
> smp_call_function_single() could be reordered by either CPU or compiler
> as follows:
>
> data->func = func;
> data->info = info;
> csd_lock(data);
>
> This might come as a bit of a surprise to the other CPU still trying to
> use the old values for data->func and data->info.
>
> Note that this smb_mb() is required even if cpu_relax() contains a
> memory barrier, as it is possible to execute csd_lock_wait() without
> executing the cpu_relax(), if you get there at just the right time.

I'm not quite sure I follow, if we observe !(flags & LOCK) then we're
guaranteed that no cpu will still needs func and info. They might still
be observing the list entry, but should not find themselves on the
cpumask -- which is guarded by ->lock.

Anyway, I'm happy to add the mb().

> > @@ -122,41 +198,35 @@ void generic_smp_call_function_interrupt
> > * It's ok to use list_for_each_rcu() here even though we may delete
> > * 'pos', since list_del_rcu() doesn't clear ->next
> > */
> > - rcu_read_lock();
> > - list_for_each_entry_rcu(data, &call_function_queue, csd.list) {
> > + list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
>
> OK... What prevents the following sequence of events?
>
> o CPU 0 calls smp_call_function_many(), targeting numerous CPUs,
> including CPU 2. CPU 0 therefore enqueues this on the global
> call_function.queue. "wait" is not specified, so CPU 0 returns
> immediately after sending the IPIs.
>
> o CPU 1 disables irqs and leaves them disabled for awhile.
>
> o CPU 2 receives the IPI, and duly calls the needed function.
> It decrements the ->refs field, but, finding the result
> non-zero, refrains from removing the element that CPU 0
> enqueued, and also refrains from invoking csd_unlock().
>
> o CPU 3 also receives the IPI, and also calls the needed function.
> Now, only CPU 1 need receive the IPI for the element to be
> removed.
>
> o CPU 3 calls smp_call_function_many(), targeting numerous CPUs,
> but -not- including CPU 2. CPU 3 therefore also this on the
> global call_function.queue and sends the IPIs, but no IPI for
> CPU 2. Your choice as to whether CPU 3 waits or not.

Right, so the queue is Entry3->Entry0 (we place new entries on at the
head).

> o CPU 2 receives CPU 3's IPI, but CPU 0's element is first on the
> list. CPU 2 fetches the pointer (via list_for_each_entry_rcu()),
> and then...

CPU0 ? You just excluded cpu2, cpu1 is still blocked, and cpu3 send the
ipi.

Furthermore, Entry3 would be first, but suppose it is Entry0, that makes
the scenario work best.

> o CPU 1 finally re-enables irqs and receives the IPIs!!! It
> finds CPU 0's element on the queue, calls the function,
> decrements the ->refs field, and finds that it is zero.
> So, CPU 1 invokes list_del_rcu() to remove the element
> (OK so far, as list_del_rcu() doesn't overwrite the next
> pointer), then invokes csd_unlock() to release the element.

CPU1 will see CPU3's element first, and CPU0's element second. But OK.

> o CPU 0 then invokes another smp_call_function_many(), also
> multiple CPUs, but -not- to CPU 2. It requeues the element
> that was just csd_unlock()ed above, carrying CPU 2 with it.
> It IPIs CPUs 1 and 3, but not CPU 2.
>
> o CPU 2 continues, and falls off the bottom of the list. It will
> continue to ignore CPU 0's IPI until some other CPU IPIs it.
> On some architectures, a single-target IPI won't cut it, only
> a multi-target IPI.
>
> Or am I missing something subtle here?

Ah, right, yes. I place new entries at the HEAD not TAIL, so that in
this case we go from:

Entry3->Entry0
^
CPU2

to

Entry0->Entry3
^
CPU2

and CPU2 will continue the list iteration, visiting Entry3 twice, which
is harmless since it will have removed itself from the cpumask the first
time around.

> If this really is a problem, there are a number of counter-based solutions
> to it. (Famous last words...)
>
> Abandoning all caution and attempting one on the fly... Make each CPU
> receiving an IPI increment one per-CPU counter upon entry, and increment
> it again upon exit with memory barriers after and before, respectively.
> Then any CPU with an even value can be ignored, and any CPU whose value
> changes can also be ignored. Of course, this means you have to scan all
> CPUs... But in the worst case, you also had to IPI them all.
>
> Given that this operation is relatively rare, it might be worth using
> shared reference counters, possibly one pair of such counters per (say)
> 16 CPUs. Then the caller flips the counter.

Yep, I almost implemented a counting RCU which increments a global
counter on IPI entry and decrements on IPI exit, but then did the above
FIFO->FILO queue thingy.

> Alternatively, you can explain to me why my scenario above cannot
> happen -- but at present, it will take some serious explaining!!!

I hope to have adequately explained it, but please, feel free to poke
more holes into it ;-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/