Re: BUG_ON(!cpus_equal(cpumask, tmp));

From: Martin J. Bligh
Date: Tue Mar 30 2004 - 20:54:13 EST


>> I think we're assuming that we don't have to because the problem is fixed
>> by the "cpus_and(tmp, cpumask, cpu_online_map)" in flush_tlb_others so we
>> don't have to. Except it's racy, and doesn't work.
>
> And it's a kludge, to work around dangling references to a CPU which has
> gone away.

Yes ;-)

>> It would seem to me that your suggestion would fix it. But isn't locking
>> cpu_online_map both simpler and (most importantly) more generic? I can't
>> imagine that we don't use this elsewhere ... suppose for instance we took
>> a timer interrupt, causing a scheduler rebalance, and moved a process to
>> an offline CPU at that point? Isn't any user of smp_call_function also racy?
>
> If we have to add any fastpath locking to cope with CPU removal or reboot
> then it's time to make CONFIG_HOTPLUG_CPU dependent upon CONFIG_BROKEN.

Yeah, but as we've proved, it's not just hotplug, it's shutdown. And I don't
think we can make that depend on CONFIG_BROKEN ;-) I don't see a *read*
side RCU lock as an impostion on the fastpath (for reading cpu_online_map),
and I don't care if writing to cpu_online_map is slower. A spinlock would
be crappy, yes ...

> yes, cpu_online_map should be viewed as a reference to the going-away CPU
> for smp_call_function purposes. However the CPU takedown code appears to
> do the right thing: it removes the cpu from cpu_online_map first, then does
> the stop_machine() thing which should ensure that all other CPUs have
> completed any cross-CPU call which they were doing, yes?

Andy almost managed to convince me that the smp_call_function stuff is safe,
based on call_lock exclusion. Except that we count that cpu stuff outside
it ... but that's trivial to fix, we just move it inside the lock (patch
below - untested, but trivial).

He also pointed out that we could fairly easily fix the tlb stuff by
taking the tlb lock before taking a cpu offline. Which still doesn't
make me desperately comfortable ... but then he's smarter than me ;-)
To me it comes down to ... do we want to lock the damned thing, or fix
all the callers to be really, really careful?

diff -purN -X /home/mbligh/.diff.exclude virgin/arch/i386/kernel/smp.c smp_call_function/arch/i386/kernel/smp.c
--- virgin/arch/i386/kernel/smp.c 2004-03-11 14:33:36.000000000 -0800
+++ smp_call_function/arch/i386/kernel/smp.c 2004-03-30 17:43:34.000000000 -0800
@@ -514,10 +514,7 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
+ int cpus;

data.func = func;
data.info = info;
@@ -527,6 +524,10 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ return 0;
+
call_data = &data;
mb();


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/