Re: [RFC PATCH 0/5] introduce sched-idle balancing
From: Abel Wu
Date: Fri Feb 25 2022 - 03:15:19 EST
Hi Mel,
On 2/25/22 12:47 AM, Mel Gorman Wrote:
On Thu, Feb 17, 2022 at 11:43:56PM +0800, Abel Wu wrote:
Current load balancing is mainly based on cpu capacity
and task util, which makes sense in the POV of overall
throughput. While there still might be some improvement
can be done by reducing number of overloaded cfs rqs if
sched-idle or idle rq exists.
An CFS runqueue is considered overloaded when there are
more than one pullable non-idle tasks on it (since sched-
idle cpus are treated as idle cpus). And idle tasks are
counted towards rq->cfs.idle_h_nr_running, that is either
assigned SCHED_IDLE policy or placed under idle cgroups.
It's not clear how your tests evaluated the balancing of SCHED_IDLE tasks
versus the existing idle balancing and isolated that impact. I suspect
the tests may primarily measured the effect of the SIS filter.
The sched-idle balancing doesn't really care about the idle tasks.
It tries to improve the non-idle tasks' performance by spreading
them out to make full use of cpu capacity.
I will do some individual tests to SIS and sched-idle balancer
each, and keep you informed.
So in short, the goal of the sched-idle balancing is to
let the *non-idle tasks* make full use of cpu resources.
To achieve that, we mainly do two things:
- pull non-idle tasks for sched-idle or idle rqs
from the overloaded ones, and
- prevent pulling the last non-idle task in an rq
The mask of overloaded cpus is updated in periodic tick
and the idle path at the LLC domain basis. This cpumask
will also be used in SIS as a filter, improving idle cpu
searching.
As the overloaded mask may be updated on each idle, it could be a
significant source of cache misses between CPUs sharing the domain for
workloads that rapidly idle so there should be data on whether cache misses
are increased heavily. It also potentially delays the CPU reaching idle
but it may not be by much.
Yes, that's why I cached overloaded status in rq->overloaded. So in
this case of short running tasks, when cpus rapidly/frequently go
idle, the cpumask/counter are not actually updated if the cached
status is already 0 (not overloaded).
The filter may be out of date. It takes up to one tick to detect
overloaded and the filter to have a positive impact. As a CPU is not
guaranteed to enter idle if there is at least one CPU-bound task, it may
also be up to 1 tick before the mask is cleared. I'm not sure this is a
serious problem though as SIS would not pick the CPU with the CPU-bound
task anyway.
Yes, it can be out of date, but increasing the accuracy means more
frequent update which would introduce cache issues you mentioned
above. Rate limit the updating to tick at the LLC basis might be an
acceptable tradeoff I presume.
At minimum, the filter should be split out and considered first as it
is the most likely reason why a performance difference was measured. It
has some oddities like why nr_overloaded is really a boolean and as
it's under rq lock, it's not clear why it's atomic. The changelog
would ideally contain some comment on the impact to cache misses
if any and some sort of proof that SIS search depth is reduced which
https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@xxxxxxxxxxxxxxxxxxx/
may be some help.
At that point, compare the idle task balancing on top to isolate how
much it improves things if any and identify why existing balancing is
insufficient. Split out the can_migrate_task change beforehand in case it
is the main source of difference as opposed to the new balancing mechanism.
The nr_overloaded sits in shared domain structure thus shared in
LLC domain and needs to be atomic_t, while rq->overloaded is a
boolean updated under rq lock. Maybe the naming can cause some
confusion, please lighten me up if you have better idea :)
And yes, I agree it would be nice if test result on SIS search
depth can be shown, and I actually did the test, but the result
didn't show a reduction in depth due to sched-idle balancing
will also consume sched-idle/idle cpus. I will apply your patch
and make some further tests on that, thanks.
Best Regards,
Abel