Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

From: Aaron Lu
Date: Tue Mar 28 2023 - 02:43:11 EST


Hi Yu,

Thanks for taking a look.

On Mon, Mar 27, 2023 at 10:45:56PM +0800, Chen Yu wrote:
> On 2023-03-27 at 13:39:55 +0800, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
> >
> > 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> > 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
> >
> > While cpus of the other node normally sees a lower cycle percent:
> >
> > 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> > 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
> >
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side.
> >
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> >
> > Tim Chen told me that PeterZ once mentioned a way to solve a similar
> > problem by making a counter per node so do the same for tg->load_avg.
> > After this change, the worst number I saw during a 5 minutes run from
> > both nodes are:
> >
> > 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> > 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
> >
> > Another observation of this workload is: it has a lot of wakeup time
> > task migrations and that is the reason why update_load_avg() and
> > update_cfs_group() shows noticeable cost. Running this workload in N
> > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > task migrations on wake up time are greatly reduced and the overhead from
> > the two above mentioned functions also dropped a lot. It's not clear to
> > me why running in multiple instances can reduce task migrations on
> > wakeup path yet.
> >
> Looks interesting, when the sysbench is 1 instance and nr_threads = nr_cpu,
> and when the launches more than 1 instance of sysbench, while nr_threads set
> to 1/N * nr_cpu, do both cases have similar CPU utilization? Currently the
> task wakeup inhibits migration wakeup if the system is overloaded.

I think this is a good point. I did notice during a run, when CPU util
is up, the migration number will drop. And 4 instances setup generally
has higher CPU util than 1 instance setup.

I should also add that in vanilla kernel, if tg is allocated on node 0
then task migrations happening on remote node is the deciding factor of
an increased cost of update_cfs_group() and update_load_avg() because
remote node has a higher cost of accessing tg->load_avg.

> [...]
> > struct task_group *sched_create_group(struct task_group *parent)
> > {
> > + size_t size = sizeof(struct task_group);
> > + int __maybe_unused i, nodes;
> > struct task_group *tg;
> >
> > - tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> > +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> > + nodes = num_possible_nodes();
> > + size += nodes * sizeof(void *);
> > + tg = kzalloc(size, GFP_KERNEL);
> > + if (!tg)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + for_each_node(i) {
> > + tg->node_info[i] = kzalloc_node(sizeof(struct tg_node_info), GFP_KERNEL, i);
> > + if (!tg->node_info[i])
> > + return ERR_PTR(-ENOMEM);
> Do we need to free tg above in case of memory leak?

Good catch, will fix this in next posting, thanks!