Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task
From: Vincent Guittot
Date: Fri Mar 22 2024 - 13:16:56 EST
On Thu, 21 Mar 2024 at 21:25, Josh Don <joshdon@xxxxxxxxxx> wrote:
>
> On Wed, Mar 20, 2024 at 9:58 AM Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
> >
> > Hi Josh,
> >
> > Sorry for the late reply.
>
> No worries, responses to your comments inline below.
>
> > > We had a user recently trigger a hard lockup which we believe is due
> > > to this patch. The user in question had O(10k) threads affinitized to
> > > a cpu; seems like the process had an out of control thread spawning
> > > issue, and was in the middle of getting killed. However, that was
> > > being slowed down due to the fact that load balance was iterating all
> >
> > Does it mean that it was progressing but not as fast as you would like
>
> It was a hard lockup, so it's more than just "not fast enough".
> Indeed it was progressing, but not at a rate sufficient to avoid
> lockup.
>
> > > these threads and bouncing the rq lock (and making no progress due to
> > > ALL_PINNED). Before this patch, load balance would quit after hitting
> > > loop_max.
> > >
> > > Even ignoring that specific instance, it seems pretty easy for this
> > > patch to cause a softlockup due to a buggy or malicious process.
> >
> > The fact that the rq is released regularly should prevent a
> > softlockup.
>
> That doesn't prevent a softlockup; kernel is stuck looping over a long
> list of tasks for too long, regardless of whether it is releasing and
> re-acquiring the rq locks.
>
> Note also that load balance can come from softirq in a context where
> we have IRQ disabled, which can lead to hard lockup as well.
fair enough
>
> > And we could even fasten can_migrate() which does a lot of
> > useless stuff for task affined to 1 cpu.
>
> That seems like a useful optimization, but not really relevant? It
> doesn't matter how small we make the constant factor, we still have an
> O(n) operation in kernel mode here.
>
> > > For the tradeoff you were trying to make in this patch (spend more
> > > time searching in the hopes that there's something migratable further
> > > in the list), perhaps it would be better to adjust
> > > sysctl.sched_nr_migrate instead of baking this into the kernel?
> >
> > That could be a solution but this increases the iterations for all
> > cases including those which are more time consuming to sort out and
> > the number of tasks that you will migrate in one lb. The latter is the
> > one which consumes time
>
> Is is really that bad? loop_max will be unchanged for most cases since
> it gets min'd with nr_running anyway. And, even if loop_max ends up
> larger in some other instances, we still terminate the iteration after
> fixing up env->imbalance (granted, we'll migrate more tasks to achieve
> a better balance with a larger loop_max, which I think is your point).
Yes, my point is that load of a task can be quite small, especially
with cgroups, so we can end up detaching/attaching a very large number
of tasks which is far more time consuming that checking if we can
migrate it or not
>
>
> Another idea then: what about separating the number of tasks we can
> move from the number of tasks we can search? You effectively want to
> keep the number of tasks that can be migrated small (nr_migrate), but
> be able to search deeper in the list for things to pull (a larger
> search_depth).
That could be a solution indeed
>
> - Josh