Re: [PATCH 0/10] bulk cpu removal support

From: Nathan Lynch
Date: Thu May 11 2006 - 20:09:11 EST


Ashok Raj wrote:
> On Thu, May 11, 2006 at 01:42:47PM -0700, Martin Bligh wrote:
> > Ashok Raj wrote:
> > >
> > >
> > > It depends on whats running at the time... with some light load, i measured
> > > wall clock time, i remember seeing 2 secs at times, but its been a long time
> > > i did that.. so take that with a pinch :-)_
> > >
> > > i will try to get those idle and load times worked out again... the best
> > > i have is a 16 way, if i get help from big system oems i will send the
> > > numbers out
> >
> > Why is taking 30s to offline CPUs a problem?
> >
>
> Well, the real problem is that for each cpu offline we schedule a RT thread
> kstopmachine() on each cpu, then turn off interrupts until this one cpu has
> removed. stop_machine_run() is a big enough sledge hammer during cpu offline
> and doing this repeatedly... say on a 4 socket system, where each socket=16
> logical cpus.
>
> the system would tend to get hick ups 64 times, since we do the stopmachine
> thread once for each logical cpu. When we want to replace a node for
> reliability reasons, its not clear if this hick ups is a good thing.

Can you provide more detail on what you mean by hiccups?


> Doing kstopmachine() on a single system is in itself noticable, what we heard
> from some OEM's is this would have other app level impact as well.

What "other app level impact"?


> With the bulk removal, we do stop machine just once, but all the 16 cpus
> get removed once hence there is just one hickup, instead of 64.

Have you done any profiling or other instrumentation that identifies
stopmachine as the real culprit here? I mean, it's a reasonable
assumption to make, but are you sure there's not something else
causing the hiccups? Perhaps contention on the cpu hotplug lock, or
something wrong in the architecture cpu_disable code?

Module unload also uses stop_machine_run, iirc. Do you see hiccups
with that, too?


> Less time to offline, avoid process and interrupt bouncing on and off a cpu
> which is just about to be offlined are almost extra fringe benefits you get
> with the bulk removal approach.

Ok, so that's not the primary motivation for these patches?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/