Re: [RFC PATCH v2 0/4] Introduce group balancer

From: Tianchen Ding
Date: Wed Mar 09 2022 - 03:31:06 EST


On 2022/3/9 01:13, Tejun Heo wrote:
Hello,

On Tue, Mar 08, 2022 at 05:26:25PM +0800, Tianchen Ding wrote:
Modern platform are growing fast on CPU numbers. To achieve better
utility of CPU resource, multiple apps are starting to sharing the CPUs.

What we need is a way to ease confliction in share mode,
make groups as exclusive as possible, to gain both performance
and resource efficiency.

The main idea of group balancer is to fulfill this requirement
by balancing groups of tasks among groups of CPUs, consider this
as a dynamic demi-exclusive mode. Task trigger work to settle it's
group into a proper partition (minimum predicted load), then try
migrate itself into it. To gradually settle groups into the most
exclusively partition.

GB can be seen as an optimize policy based on load balance,
it obeys the main idea of load balance and makes adjustment
based on that.

Our test on ARM64 platform with 128 CPUs shows that,
throughput of sysbench memory is improved about 25%,
and redis-benchmark is improved up to about 10%.

The motivation makes sense to me but I'm not sure this is the right way to
architecture it. We already have the framework to do all these - the sched
domains and the load balancer. Architecturally, what the suggested patchset
is doing is building a separate load balancer on top of cpuset after using
cpuset to disable the existing load balancer, which is rather obviously
convoluted.


"the sched domains and the load balancer" you mentioned are the ways to "balance" tasks on each domains. However, this patchset aims to "group" them together to win hot cache and less competition, which is different from load balancer. See commit log of the patch 3/4 and this link:
https://lore.kernel.org/all/11d4c86a-40ef-6ce5-6d08-e9d0bc9b512a@xxxxxxxxxxxxxxxxx/

* AFAICS, none of what the suggested code does is all that complicated or
needs a lot of input from userspace. it should be possible to parametrize
the existing load balancer to behave better.


Group balancer mainly needs 2 inputs from userspace: cpu partition info and cgroup info.
Cpu partition info does need user input (and maybe a bit complicated). As a result, the division methods are __free__ to users(can refer to NUMA nodes, clusters, cache, etc.)
Cgroup info doesn't need extra input. It's naturally configured.

It do parametrize the existing load balancer to behave better.
Group balancer is a kind of optimize policy, and should obey the basic
policy (load balance) and improve it.
The relationship between load balancer and group balancer is explained in detail at the above link.

* If, for some reason, you need more customizable behavior in terms of cpu
allocation, which is what cpuset is for, maybe it'd be better to build the
load balancer in userspace. That'd fit way better with how cgroup is used
in general and with threaded cgroups, it should fit nicely with everything
else.


We put group balancer in kernel space because this new policy does not depend on userspace apps. It's a "general" feature.
Doing "dynamic cpuset" in userspace may also introduce performance issue, since it may need to bind and unbind different cpusets for several times, and is too strict(compared with our "soft bind").