Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared
From: Peter Zijlstra
Date: Wed May 11 2016 - 08:34:04 EST
On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy
> > int cpu = smp_processor_id();
> >
> > rcu_read_lock();
> > - sd = rcu_dereference(per_cpu(sd_busy, cpu));
> > + sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >
> > if (!sd || !sd->nohz_idle)
> > goto unlock;
> > sd->nohz_idle = 0;
> >
> > - atomic_inc(&sd->groups->sgc->nr_busy_cpus);
> > + atomic_inc(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
>
> This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES,
>
Hmm, PPC folks; what does your topology look like?
Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c
seems to suggest your cores do not share cache at all.
https://en.wikipedia.org/wiki/POWER7 seems to agree and states
"4 MB L3 cache per C1 core"
And http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf
also explicitly draws pictures with the L3 per core.
_however_, that same document describes L3 inter-core fill and lateral
cast-out, which sounds like the L3s work together to form a node wide
caching system.
Do we want to model this co-operative L3 slices thing as a sort of
node-wide LLC for the purpose of the scheduler ?
While we should definitely fix the assumption that an LLC exists (and I
need to look at why it isn't set to the core domain instead as well),
the scheduler does try and scale things by 'assuming' LLC := node.
It does this for NOHZ, and these here patches under discussion would be
doing the same for idle-core state.
Would this make sense for power, or should we somehow think of something
else?