Re: [PATCH v2 0/4] per-cgroup numa suite

From: çè
Date: Wed Jul 24 2019 - 22:33:19 EST


Hi, Peter

Now we have all these stuff in cpu cgroup, with the new statistic
folks should be able to estimate their per-cgroup workloads on
numa platform, and numa group + cling would help to address the
issue when their workloads can't be settled on one node.

How do you think about this version :-)

Regards,
Michael Wang

On 2019/7/16 äå11:38, çè wrote:
> During our torturing on numa stuff, we found problems like:
>
> * missing per-cgroup information about the per-node execution status
> * missing per-cgroup information about the numa locality
>
> That is when we have a cpu cgroup running with bunch of tasks, no good
> way to tell how it's tasks are dealing with numa.
>
> The first two patches are trying to complete the missing pieces, but
> more problems appeared after monitoring these status:
>
> * tasks not always running on the preferred numa node
> * tasks from same cgroup running on different nodes
>
> The task numa group handler will always check if tasks are sharing pages
> and try to pack them into a single numa group, so they will have chance to
> settle down on the same node, but this failed in some cases:
>
> * workloads share page caches rather than share mappings
> * workloads got too many wakeup across nodes
>
> Since page caches are not traced by numa balancing, there are no way to
> realize such kind of relationship, and when there are too many wakeup,
> task will be drag from the preferred node and then migrate back by numa
> balancing, repeatedly.
>
> Here the third patch try to address the first issue, we could now give hint
> to kernel about the relationship of tasks, and pack them into single numa
> group.
>
> And the forth patch introduced numa cling, which try to address the wakup
> issue, now we try to make task stay on the preferred node on wakeup in fast
> path, in order to address the unbalancing risk, we monitoring the numa
> migration failure ratio, and pause numa cling when it reach the specified
> degree.
>
> Since v1:
> * move statistics from memory cgroup into cpu group
> * statistics now accounting in hierarchical way
> * locality now accounted into 8 regions equally
> * numa cling no longer override select_idle_sibling, instead we
> prevent numa swap migration with tasks cling to dst-node, also
> prevent wake affine to drag tasks away which already cling to
> prev-cpu
> * other refine on comments and names
>
> Michael Wang (4):
> v2 numa: introduce per-cgroup numa balancing locality statistic
> v2 numa: append per-node execution time in cpu.numa_stat
> v2 numa: introduce numa group per task group
> v4 numa: introduce numa cling feature
>
> include/linux/sched.h | 8 +-
> include/linux/sched/sysctl.h | 3 +
> kernel/sched/core.c | 85 ++++++++
> kernel/sched/debug.c | 7 +
> kernel/sched/fair.c | 510 ++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 41 ++++
> kernel/sysctl.c | 9 +
> 7 files changed, 651 insertions(+), 12 deletions(-)
>