Re: [RFC PATCH] sched/fair: Make tg->load_avg per node
From: Aaron Lu
Date: Sat Apr 22 2023 - 02:21:50 EST
On Sat, Apr 22, 2023 at 12:01:59PM +0800, Chen Yu wrote:
> On 2023-03-27 at 13:39:55 +0800, Aaron Lu wrote:
> >
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> [...]
> > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > {
> > long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > + int node = cpu_to_node(cfs_rq->rq->cpu);
> >
> > /*
> > * No need to update load_avg for root_task_group as it is not used.
> > @@ -3616,7 +3617,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > return;
> >
> > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > - atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > + atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg);
> When entered enqueue_entity(cfs_rq, se) -> update_tg_load_avg(cfs_rq)
> the cfs_rq->rq->cpu is not necessary the current cpu, so the node returned by
> cpu_to_node(cfs_rq->rq->cpu) is not necessary the current node as the current
> CPU, would atomic_add still introduce cross-node overhead due to remote access
> to cfs_rq->tg->node_info[node]->load_avg in this case, or do I miss something?
That's good point.
The chance cpu being different is high, but the node being different is
pretty low though. The wakeup path will not do cross LLC activation with
TTWU_QUEUE, but the load balance path does make it possible to dequeue
a task that is on a remote node and I think that's the only path where
the two cpus(current vs cfs_rq->rq->cpu) can be on different nodes.
A quick test shows during a 5s window of running hackbench on a VM, the
number of times when the two nodes derived from current and cfs_rq->rq->cpu
being different are less than 100 while being equal are 992548.
I'll switch to using smp_processor_id() in next posting since it's the
right thing to do, thanks for spotting this!