Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task
From: Josh Don
Date: Thu Mar 21 2024 - 16:25:59 EST
On Wed, Mar 20, 2024 at 9:58 AM Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> Hi Josh,
>
> Sorry for the late reply.
No worries, responses to your comments inline below.
> > We had a user recently trigger a hard lockup which we believe is due
> > to this patch. The user in question had O(10k) threads affinitized to
> > a cpu; seems like the process had an out of control thread spawning
> > issue, and was in the middle of getting killed. However, that was
> > being slowed down due to the fact that load balance was iterating all
>
> Does it mean that it was progressing but not as fast as you would like
It was a hard lockup, so it's more than just "not fast enough".
Indeed it was progressing, but not at a rate sufficient to avoid
lockup.
> > these threads and bouncing the rq lock (and making no progress due to
> > ALL_PINNED). Before this patch, load balance would quit after hitting
> > loop_max.
> >
> > Even ignoring that specific instance, it seems pretty easy for this
> > patch to cause a softlockup due to a buggy or malicious process.
>
> The fact that the rq is released regularly should prevent a
> softlockup.
That doesn't prevent a softlockup; kernel is stuck looping over a long
list of tasks for too long, regardless of whether it is releasing and
re-acquiring the rq locks.
Note also that load balance can come from softirq in a context where
we have IRQ disabled, which can lead to hard lockup as well.
> And we could even fasten can_migrate() which does a lot of
> useless stuff for task affined to 1 cpu.
That seems like a useful optimization, but not really relevant? It
doesn't matter how small we make the constant factor, we still have an
O(n) operation in kernel mode here.
> > For the tradeoff you were trying to make in this patch (spend more
> > time searching in the hopes that there's something migratable further
> > in the list), perhaps it would be better to adjust
> > sysctl.sched_nr_migrate instead of baking this into the kernel?
>
> That could be a solution but this increases the iterations for all
> cases including those which are more time consuming to sort out and
> the number of tasks that you will migrate in one lb. The latter is the
> one which consumes time
Is is really that bad? loop_max will be unchanged for most cases since
it gets min'd with nr_running anyway. And, even if loop_max ends up
larger in some other instances, we still terminate the iteration after
fixing up env->imbalance (granted, we'll migrate more tasks to achieve
a better balance with a larger loop_max, which I think is your point).
Another idea then: what about separating the number of tasks we can
move from the number of tasks we can search? You effectively want to
keep the number of tasks that can be migrated small (nr_migrate), but
be able to search deeper in the list for things to pull (a larger
search_depth).
- Josh