Re: [RFC PATCH 0/5] introduce sched-idle balancing

From: Abel Wu
Date: Fri Feb 25 2022 - 08:20:17 EST


Hi Mel, thanks a lot for your review!

On 2/25/22 6:16 PM, Mel Gorman Wrote:
On Fri, Feb 25, 2022 at 04:15:06PM +0800, Abel Wu wrote:
As the overloaded mask may be updated on each idle, it could be a
significant source of cache misses between CPUs sharing the domain for
workloads that rapidly idle so there should be data on whether cache misses
are increased heavily. It also potentially delays the CPU reaching idle
but it may not be by much.

Yes, that's why I cached overloaded status in rq->overloaded. So in
this case of short running tasks, when cpus rapidly/frequently go
idle, the cpumask/counter are not actually updated if the cached
status is already 0 (not overloaded).


Which is a good idea in some respects. It tries to limit the number of
updates and treats it as a boolean but it's probably prone to races.

The filter may be out of date. It takes up to one tick to detect
overloaded and the filter to have a positive impact. As a CPU is not
guaranteed to enter idle if there is at least one CPU-bound task, it may
also be up to 1 tick before the mask is cleared. I'm not sure this is a
serious problem though as SIS would not pick the CPU with the CPU-bound
task anyway.

Yes, it can be out of date, but increasing the accuracy means more
frequent update which would introduce cache issues you mentioned
above. Rate limit the updating to tick at the LLC basis might be an
acceptable tradeoff I presume.


At minimum, the filter should be split out and considered first as it
is the most likely reason why a performance difference was measured. It
has some oddities like why nr_overloaded is really a boolean and as
it's under rq lock, it's not clear why it's atomic. The changelog
would ideally contain some comment on the impact to cache misses
if any and some sort of proof that SIS search depth is reduced which
https://lore.kernel.org/lkml/20210726102247.21437-2-mgorman@xxxxxxxxxxxxxxxxxxx/
may be some help.

At that point, compare the idle task balancing on top to isolate how
much it improves things if any and identify why existing balancing is
insufficient. Split out the can_migrate_task change beforehand in case it
is the main source of difference as opposed to the new balancing mechanism.


The nr_overloaded sits in shared domain structure thus shared in
LLC domain and needs to be atomic_t, while rq->overloaded is a
boolean updated under rq lock. Maybe the naming can cause some
confusion, please lighten me up if you have better idea :)


The naming doesn't help because it's not really "the number of overloaded
rq's". atomic_t would be slightly safer against parallel updates but
it's race prone. I didn't think about it deeply but I suspect that two
separate rq's could disagree on what the boolean value should be if one rq
is overloaded, the other is not and they are updating via the idle path at
the same time. This probably can happen because the locking is rq based
and the cpumask is shared. On the flip side, making it an accurate count
would result in more updates and incur cache misses as well as probably
needing a cpumask check instead of a nr_overloaded comparison to determine
if the rq is already accounted for so it costs more. You are very likely
trading accuracy versus cost of update.

The boolean value (rq->overloaded) is accessed under rq lock, and almost
accessed by its own rq except the very rare case in sched_idle_balance()
where a double check failed on cfs_rq_overloaded(). So this value should
be accurate and has good data locality.

But as you said, the nr_overloaded and cpu mask are race prone in the
following pattern in my patches:

if (nr_overloaded > 0)
/* nr_overloaded can be zero now */
read(overloaded_mask);

since the mask is accessed without rq locked, the cost might not be too
much. This is quite similar with the idle_cpu() usage in SIS I guess.


Whichever choice you make, add comments on the pros/cons and describe
the limitation of either approach. e.g. if overloaded is effectively a
boolean, describe in a comment the limitations.

OK, will do.


And yes, I agree it would be nice if test result on SIS search
depth can be shown, and I actually did the test, but the result
didn't show a reduction in depth due to sched-idle balancing
will also consume sched-idle/idle cpus. I will apply your patch
and make some further tests on that, thanks.


Just remember to use the patch to measure changes in SIS depth but
performance figures should not include the patch as the schedstat
overhead distorts results.

Yes, agreed.


Also place the filter first and do any measurements of any change to
balancing versus the filter. I'm suggesting placing the filter first
because it's less controversial than a new balancer. Just be aware that
the filter alone is not a guarantee of merging as there have been a few
approaches to filtering and so far all of them had downsides on either cost

Yes, understood. I will adjust the patches as you suggested and send v2
together with more tests next week.

or accuracy. IIRC the only active approach to reducing search cost in SIS
is https://lore.kernel.org/all/20220207034013.599214-1-yu.c.chen@xxxxxxxxx/
and it's likely to get a new version due to
https://lore.kernel.org/all/20220207135253.GF23216@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/.
It also updates sched_domain_shared but with a single boolean instead of
an atomic+cpumask.


Chen Yu's patch disables idle cpu searching in SIS when the LLC domain
is overloaded (that is 85% capacity usage) and Peter suggested him use
this metric to replace/improve SIS_PROP feature to make search depth
varying gently.

I don't think either of the two approaches conflict with mine, as they
are to reduce the effort of searching when system is busy and cpus are
not likely to be idle, and mine is to consume sched-idle/idle cpus by
themselves by pulling non-idle tasks from overloaded rqs so there will
be fewer sched-idle/idle cpus.

Thanks and best regards,
Abel