Re: [PATCH] sched/numa: advanced per-cgroup numa statistic

From: Peter Zijlstra
Date: Mon Oct 28 2019 - 09:02:53 EST


On Thu, Oct 24, 2019 at 11:08:01AM +0800, çè wrote:
> Currently there are no good approach to monitoring the per-cgroup
> numa efficiency, this could be a trouble especially when groups
> are sharing CPUs, it's impossible to tell which one caused the
> remote-memory access by reading hardware counter since multiple
> workloads could sharing the same CPU, which make it painful when
> one want to find out the root cause and fix the issue.
>
> In order to address this, we introduced new per-cgroup statistic
> for numa:
> * the numa locality to imply the numa balancing efficiency
> * the numa execution time on each node
>
> The task locality is the local page accessing ratio traced on numa
> balancing PF, and the group locality is the topology of task execution
> time, sectioned by the locality into 8 regions.
>
> For example the new entry 'cpu.numa_stat' show:
> locality 15393 21259 13023 44461 21247 17012 28496 145402
> exectime 311900 407166
>
> Here we know the workloads executed 311900ms on node_0 and 407166ms
> on node_1, tasks with locality around 0~12% executed for 15393 ms, and
> tasks with locality around 88~100% executed for 145402 ms, which imply
> most of the memory access is local access, for the workloads of this
> group.
>
> By monitoring the new statistic, we will be able to know the numa
> efficiency of each per-cgroup workloads on machine, whatever they
> sharing the CPUs or not, we will be able to find out which one
> introduced the remote access mostly.
>
> Besides, per-node memory topology from 'memory.numa_stat' become
> more useful when we have the per-node execution time, workloads
> always executing on node_0 while it's memory is all on node_1 is
> usually a bad case.
>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Michal Koutnà <mkoutny@xxxxxxxx>
> Signed-off-by: Michael Wang <yun.wang@xxxxxxxxxxxxxxxxx>

Mel, can you have a peek at this too?


So this is the part I like least:

> +DEFINE_PER_CPU(struct numa_stat, root_numa_stat);
> +
> +int alloc_tg_numa_stat(struct task_group *tg)
> +{
> + tg->numa_stat = alloc_percpu(struct numa_stat);
> + if (!tg->numa_stat)
> + return 0;
> +
> + return 1;
> +}
> +
> +void free_tg_numa_stat(struct task_group *tg)
> +{
> + free_percpu(tg->numa_stat);
> +}
> +
> +static void update_tg_numa_stat(struct task_struct *p)
> +{
> + struct task_group *tg;
> + unsigned long remote = p->numa_faults_locality[3];
> + unsigned long local = p->numa_faults_locality[4];
> + int idx = -1;
> +
> + /* Tobe scaled? */
> + if (remote || local)
> + idx = NR_NL_INTERVAL * local / (remote + local + 1);
> +
> + rcu_read_lock();
> +
> + tg = task_group(p);
> + while (tg) {
> + /* skip account when there are no faults records */
> + if (idx != -1)
> + this_cpu_inc(tg->numa_stat->locality[idx]);
> +
> + this_cpu_inc(tg->numa_stat->jiffies);
> +
> + tg = tg->parent;
> + }
> +
> + rcu_read_unlock();
> +}

Thing is, we already have a cgroup hierarchy walk in the tick; see
task_tick_fair().

On top of that, you're walking an entirely different set of pointers,
instead of cfs_rq, you're walking tg->parent, which pretty much
guarantees you're adding even more cache misses.

How about you stick those numa_stats in cfs_rq (with cacheline
alignment) and see if you can frob your update loop into the cgroup walk
we already do.