Re: [PATCH 0/10] bulk cpu removal support

From: Ashok Raj
Date: Thu May 11 2006 - 18:09:50 EST


On Thu, May 11, 2006 at 01:42:47PM -0700, Martin Bligh wrote:
> Ashok Raj wrote:
> >
> >
> > It depends on whats running at the time... with some light load, i measured
> > wall clock time, i remember seeing 2 secs at times, but its been a long time
> > i did that.. so take that with a pinch :-)_
> >
> > i will try to get those idle and load times worked out again... the best
> > i have is a 16 way, if i get help from big system oems i will send the
> > numbers out
>
> Why is taking 30s to offline CPUs a problem?
>

Well, the real problem is that for each cpu offline we schedule a RT thread
kstopmachine() on each cpu, then turn off interrupts until this one cpu has
removed. stop_machine_run() is a big enough sledge hammer during cpu offline
and doing this repeatedly... say on a 4 socket system, where each socket=16
logical cpus.

the system would tend to get hick ups 64 times, since we do the stopmachine
thread once for each logical cpu. When we want to replace a node for
reliability reasons, its not clear if this hick ups is a good thing.

Doing kstopmachine() on a single system is in itself noticable, what we heard
from some OEM's is this would have other app level impact as well.

With the bulk removal, we do stop machine just once, but all the 16 cpus
get removed once hence there is just one hickup, instead of 64.

Less time to offline, avoid process and interrupt bouncing on and off a cpu
which is just about to be offlined are almost extra fringe benefits you get
with the bulk removal approach.

--
Cheers,
Ashok Raj
- Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/