Re: [PATCH v6 2/4] sched/fair: Check a task has a fitting cpu when updating misfit

From: Qais Yousef
Date: Thu Mar 21 2024 - 08:20:59 EST


On 03/07/24 18:54, Dietmar Eggemann wrote:
> On 07/03/2024 11:35, Qais Yousef wrote:
> > On 03/07/24 10:14, Vincent Guittot wrote:
> >> On Wed, 6 Mar 2024 at 22:47, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >>>
> >>> On 03/03/24 17:44, Qais Yousef wrote:
> >>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 174687252e1a..b0e60a565945 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8260,6 +8260,8 @@ static void set_task_max_allowed_capacity(struct task_struct *p)
> >>>> cpumask_t *cpumask;
> >>>>
> >>>> cpumask = cpu_capacity_span(entry);
> >>>> + if (!cpumask_intersects(cpu_active_mask, cpumask))
> >>>> + continue;
> >>>> if (!cpumask_intersects(p->cpus_ptr, cpumask))
> >>>> continue;
> >>>>
> >>>> @@ -8269,6 +8271,53 @@ static void set_task_max_allowed_capacity(struct task_struct *p)
> >>>> rcu_read_unlock();
> >>>> }
> >>>>
> >>>> +static void __update_tasks_max_allowed_capacity(unsigned long capacity)
> >>>> +{
> >>>> + struct task_struct *g, *p;
> >>>> +
> >>>> + for_each_process_thread(g, p) {
> >>>> + if (fair_policy(p->policy) && p->max_allowed_capacity == capacity)
> >>>
> >>> This condition actually not good enough. We need to differentiate between going
> >>> online/offline. I didn't want to call set_task_max_allowed_capacity()
> >>> unconditionally and make hotplug even slower.
> >>
> >> But should we even try to fix this ? hotplugging a cpu is a special
> >> case and with patch 4 you will not increase lb_interval anymore
> >
> > I don't care to be honest and this was my first reaction, but I couldn't ignore
> > the report.
>
> Seeing a 'max_allowed_capacity' on the task which is not achievable
> anymore due to CPU hp will still cause MF activity. So it's a special
> case but CPU hp is part of mainline ... ?

We shouldn't leave anything broken for sure. What's the failure mode you're
worrying about? The side effect is that we'll do unnecessary misfit load
balance. The biggest impact I see is that if we have true imbalance, then
because misfit_lb type is more important of other ones then we can end up
delaying the other type of lb. But we won't drive the balance_interval etc high
anymore.

>
> > I will need to do something to handle the dynamic EM changing capacities anyway
> > after 6.9 merge window. Or maybe now; I still haven't thought about it. I am
>
> Do you think about the case that the reloadable EM contains a
> 'table[ highest OPP].performance' value which is different to
> arch_scale_cpu_capacity()?

Yeah, this is supported right?

>
> Can we reject those EM reloads to avoid this mismatch?

This would defeat the purpose of updating the EM though?

>
> > hoping I can trigger the update somewhere from the topology code. Maybe that
> > work will make handling hotplug easier than the approach I've taken now on
> > rq_online/offline.
> >
> > FWIW, I have a working patch that solves the problem. The only drawback is that
> > rq_online/offline() are not only called from sched_cpu_activate/deactivate()
> > path but from build_sched_domains() path which for some reasons ends up calling
> > rq_offline/online() for each cpu in the map. To be even more efficient I need
>
> This can be avoided IMHO when you do this only for 'cpu_of(rq) ==
> smp_processor_id()'.

It feels hacky, but probably the more straightforward way to do it. I was
thinking we can refactor to be more specific when rq_offline() is done due to
hotplug or not.

>
> For off-lining there will be only one such call to rq_offline_fair()
> (from sched_cpu_deactivate()) but for on-lining there are still 2 such
> calls to rq_online_fair() (from sched_cpu_activate() and rq_attach_root()).
>
> > to teach rq_offline/online() to differentiate between the two. Or refactor the
> > code. Which if you don't think it's important too I'd be happy to drop this and
> > replace it with a comment to see if someone cares. Only testing and dev could
> > end up using hotplug; so there could be a difference in behavior in regards how
> > often misfit migration can kick. But as you said hopefully with these fixes
> > it'd just end up being unnecessary work. The only potential problem maybe is
> > that misfit lb has a precedence over other types of lb types; so we could
> > end up delaying load imbalance if there's a pointless misfit lb?
> >
> > I'm happy to follow the crowd. But it'd be nice if this series can be made
> > mergeable with follow up work. It'd make life much easier for me.
>
> Or is the plan to only go with '[PATCH v6 4/4] sched/fair: Don't double
> balance_interval for migrate_misfit' ?

Yeah I think this is enough as hotplug operations to make a whole capacity
level disappear is not a normal operation even during testing and development
which. Systems with a single big core could easily lead to this situation, but
if someone is testing with this, then the perf impact of losing this whole
capacity level is more impactful than the minor sub optimality we have here.
Load balance should not be delayed, but could kick off unnecessarily.

I think we can skip the handling unless someone comes up with a stronger reason
to care, but for the record here's a working patch. Unless I hear strong
opinions it is worth it, I'll send v8 to address your other comments.


Thanks!

--
Qais Yousef


--->8---