Re: [PATCH]cpuset: add new API to change cpuset top group's cpus

From: Shaohua Li
Date: Wed May 20 2009 - 21:22:30 EST

Next message: Shivdas Gujare: "Confused with CONFIG_NO_HZ and CONFIG_HZ"
Previous message: KOSAKI Motohiro: "Re: [PATCH 1/3] tracing: add __print_flags for events"
In reply to: Vaidyanathan Srinivasan: "Re: [PATCH]cpuset: add new API to change cpuset top group's cpus"
Next in thread: Vaidyanathan Srinivasan: "Re: [PATCH]cpuset: add new API to change cpuset top group's cpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, May 21, 2009 at 01:36:35AM +0800, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra <peterz@xxxxxxxxxxxxx> [2009-05-20 15:41:55]:
>
> > On Wed, 2009-05-20 at 15:13 +0200, Andi Kleen wrote:
> > > Thanks for the explanation.
> > >
> > > My naive reaction would be to fail if the socket to be taken out
> > > is the only member of some cpuset. Or maybe break affinities in this case.
> >
> > Right, breaking affinities would go against the policy of the admin, I'm
> > not sure we'd want to go there. We could start generating msgs about how
> > we're in thermal trouble and the given configuration is obstructing
> > counter measures etc..
> >
> > Currently hot-unplug does break affinities, but that's an explicit
> > action by the admin himself, so he gets what he asks for (and we do
> > generate complaints in syslog about it).
> >
> > [ Same scenario for the HPC guys who affinity fix all their threads to
> > specific cpus, there's really nothing you can do there. Then again
> > such folks generally run their machines at 100% so they'd better
> > be able to deal with their thermal peak capacity anyway. ]
> >
> > > > You really want to start shrinking the generic computational capacity
> > > > first.
> > >
> > > One general issue to remember that if you don't react to the platform hint
> > > the platform will likely force a lower p-state on you to not exceed
> > > the thermal limits, making everyone slower.
> > >
> > > (this will likely also not make your real time process happy)
> >
> > Quite.
> >
> > > So it's a bit more than a hint; it's more like a command "or else"
> > >
> > > So it's a good idea to react or at least make at least a reasonable attempt
> > > to react.
> >
> > Sure, does the thing give more than a: 'react now, or else' impulse?
> > That is, can we see it coming, or will we have to deal with it when
> > we're there?
> >
> > The latter also has the problem that you have to react very quickly.
> >
> > > > The thing is, you cannot simply rip cpus out from under a system, people
> > > > might rely on them being there and have policy attached to them -- esp.
> > > > people touching cpusets should know that a machine isn't configured
> > > > homogeneous and any odd cpu will do.
> > >
> > > Ok, so do you think it's possible to figure out based on the cpuset
> > > graph / real time runqueue if a socket can be taken out?
> >
> > Right, so all of this depends on a number of things, how frequent and
> > how fast would these situations occur?
> >
> > I would think they'd be rare events, otherwise you really messed up your
> > infrastructure. I also think reaction times should be in the seconds,
> > otherwise you're cutting it way to close.
> >
> >
> > The work IBM has been doing is centered around overloading neighbouring
> > packages in order to keep some idle. The overload is exposed as a
> > percentage.
> >
> > This works within scheduling domains, so if you carve your machine up in
> > tiny (<= 1 package) domains its impossible to do anything (corner case,
> > we could send cries for help syslog's way).
> >
> > I was hoping we could control the situation with that. But for that to
> > work we need some gradual information in order to make that
> > thermal<->overload feedback work.
>
> The advantages of this method is to reduce load on one package and not
> target a particular CPU. This is less restrictive and can allow the
> load balancer to work out the details. Keeping a core idle on an
> average (over a time interval) is good enough to reduce the power and
> heat.
>
> Here we need not touch the RT jobs or break use space policies. We
> effectively reduce capacity and let the loadbalancer have the
> flexibility of figuring out which CPU should not be scheduled now.
>
> That said, this is not useful for a 'cpu cache error' case, in which
> case you will have to cpu-hot-unplug anyway. You don't want any
> interrupts/timers to land there in an unreliable CPU.
>
> Overloading the powersave load balancer to assume reduced capacity on
> some of the packages while overloading some others packages is the
> core idea. The RFC patches still need a lot of work to meet the
> required functionality.
So the main concern is breaking user policy, but it appears any approach
(cpu hotplug/cpuset) will break user policy (affinity). I wonder how the
scheduler approach can overcome this to my little scheduler knowledge.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Shivdas Gujare: "Confused with CONFIG_NO_HZ and CONFIG_HZ"
Previous message: KOSAKI Motohiro: "Re: [PATCH 1/3] tracing: add __print_flags for events"
In reply to: Vaidyanathan Srinivasan: "Re: [PATCH]cpuset: add new API to change cpuset top group's cpus"
Next in thread: Vaidyanathan Srinivasan: "Re: [PATCH]cpuset: add new API to change cpuset top group's cpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]