Re: [git pull for -mm] CPU isolation extensions (updated2)

From: Nick Piggin
Date: Tue Feb 12 2008 - 23:11:59 EST


On Wednesday 13 February 2008 14:32, Max Krasnyansky wrote:
> David Miller wrote:
> > From: Nick Piggin <nickpiggin@xxxxxxxxxxxx>
> > Date: Tue, 12 Feb 2008 17:41:21 +1100
> >
> >> stop machine is used for more than just module loading and unloading.
> >> I don't think you can just disable it.
> >
> > Right, in particular it is used for CPU hotplug.
>
> Ooops. Totally missed that. And a bunch of other places.
>
> [maxk@duo2 cpuisol-2.6.git]$ git grep -l stop_machine_run
> Documentation/cpu-hotplug.txt
> arch/s390/kernel/kprobes.c
> drivers/char/hw_random/intel-rng.c
> include/linux/stop_machine.h
> kernel/cpu.c
> kernel/module.c
> kernel/stop_machine.c
> mm/page_alloc.c
>
> I wonder why I did not see any issues when I disabled stop machine
> completely. I mentioned in the other thread that I commented out the part
> that actually halts the machine and ran it for several hours on my dual
> core laptop and on the quad core server. Tried all kinds of workloads,
> which include constant module removal and insertion, and cpu hotplug as
> well. It cannot be just luck :).

It really is. With subtle races, it can take a lot more than a few
hours. Consider that we have subtle races still in the kernel now,
which are almost never or rarely hit in maybe 10,000 hours * every
single person who has been using the current kernel for the past
year.

For a less theoretical example -- when I was writing the RCU radix
tree code, I tried to run directed stress tests on a 64 CPU Altix
machine (which found no bugs). Then I ran it on a dedicated test
harness that could actually do a lot more than the existing kernel
users are able to, and promptly found a couple more bugs (on a 2
CPU system).

But your primary defence against concurrency bugs _has_ to be
knowing the code and all its interactions.


> Clearly though, you guys are right. It cannot be simply disabled. Based on
> the above grep it's needed for CPU hotplug, mem hotplug, kprobes on s390
> and intel rng driver. Hopefully we can avoid it at least in module
> insertion/removal.

Yes, reducing the number of users by going through their code and
showing that it is safe, is the right way to do this. Also, you
could avoid module insertion/removal?

FWIW, I think the idea of trying to turn Linux into giving hard
realtime guarantees is just insane. If that is what you want, you
would IMO be much better off to spend effort with something like
improving adeos and communicatoin/administration between Linux and
the hard-rt kernel.

But don't let me dissuade you from making these good improvements
to Linux as well :) Just that it isn't really going to be hard-rt
in general.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/