Re: [PATCH] cpu hotplug, sched: Introduce cpu_active_map and redoscheddomainmanagment (take 2)

From: Max Krasnyansky
Date: Thu Jul 17 2008 - 14:53:34 EST

Gregory Haskins wrote:
>>>> On Thu, Jul 17, 2008 at 3:16 AM, in message <487EF1E9.2040101@xxxxxxxxxxxx>,
> Max Krasnyansky <maxk@xxxxxxxxxxxx> wrote:
>> Gregory Haskins wrote:
>>> Well, admittedly I am not entirely clear on what problem is being solved as
>>> I was not part of the original thread with Linus. My impression of what you
>>> were trying to solve was to eliminate the need to rebuild the domains for a
>>> hotplug event (which I think is a good problem to solve), thus eliminating
>>> some complexity and (iiuc) races there.
>>> However, based on what you just said, I am not sure I've got that entirely
>>> right anymore. Can you clarify the intent (or point me at the original
>> thread)
>>> so we are on the same page?
>> Here is the link to the original thread
>> And here is where Linus explained the idea
>> I'll reply to the rest of your email tomorrow (can't keep my yes open any
>> longer :)).
>> Max
> Hi Max,
> Thanks for the pointers. I see that I did indeed misunderstand the intent of the patch.
> It seems you already solved the rebuild problem, and were just trying to solve the
> "migrate to a dead cpu" problem that Linus mentions as a solution with cpu_active_map.
Yes. btw they are definitely related, because the reason we were blowing away
the domains is to avoid "migration to a dead cpu". ie We were relying on the
fact that domain masks never contain cpus that are either dying or already dead.

> In that case, note that rq->rd->online already fits the bill, I believe. In a nutshell,
> rq->rd->span contains all the cpus within your disjoint cpuset, and rq->rd->online,
> contains the subset of rq->rd->span that are online. The online bit is cleared at the
> earliest point in cpu hotplug removal (DYING), and it is set at the very latest point on
> insertion (ONLINE). Therefore it is redundant with the cpus_active_map concept.
> I think the simplest solution is to make sure that we cpus_and against rq->rd->online
> before allowing a migration. This is how I intended the mask to be used, anyway. Its
> what the RT scheduler does. It sounds like we just need to touch up the few places
> in the CFS side that were causing those oops.
> Thoughts?
None at this point :). I need to run right now and will try to look at this
later today. My knowledge of the internal sched structs is definitely lacking
so I need to look at the rq->rd thing to have and opinion.


