[RFC PATCH 0/5] NUMA Balancer Suite

From: çè
Date: Sun Apr 21 2019 - 22:11:52 EST


We have NUMA Balancing feature which always trying to move pages
of a task to the node it executed more, while still got issues:

* page cache can't be handled
* no cgroup level balancing

Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
below scenery could be easily observed:

NODE0 | NODE1
|
CPU0 CPU1 | CPU2 CPU3
task_A0 task_A1 | task_A2 task_A3
task_B0 task_B1 | task_B2 task_B3

and usually with the equal memory consumption on each node, when tasks have
similar behavior.

In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0,
pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly,
depends on the first read/write CPU location.

Let's suppose another scenery:

NODE0 | NODE1
|
CPU0 CPU1 | CPU2 CPU3
task_A0 task_A1 | task_B0 task_B1
task_A2 task_A3 | task_B2 task_B3

By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads
of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same
but related tasks could share a closer cpu cache, while cache still randomly located.

Now what if the workloads generate lot's of page cache, and most of the memory
accessing are page cache writing?

A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0
was already on NODE0 before it read/write files, caches will be there, so how to
make sure this happen?

Usually we could solve this problem by binding workloads on a single node, if the
cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0,
the numa bonus will be maximum.

However, this require a very well administration on specified workloads, suppose in our
cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a
single node would be a bad idea.

So what we need is a way to detect memory topology on cgroup level, and try to migrate
cpu/mem resources to the node with most of the caches there, as long as the resource
is plenty on that node.

This patch set introduced:
* advanced per-cgroup numa statistic
* numa preferred node feature
* Numa Balancer module

Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus
as much as possible.

Michael Wang (5):
numa: introduce per-cgroup numa balancing locality statistic
numa: append per-node execution info in memory.numa_stat
numa: introduce per-cgroup preferred numa node
numa: introduce numa balancer infrastructure
numa: numa balancer

drivers/Makefile | 1 +
drivers/numa/Makefile | 1 +
drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
include/linux/memcontrol.h | 99 ++++++
include/linux/sched.h | 9 +-
kernel/sched/debug.c | 8 +
kernel/sched/fair.c | 41 +++
mm/huge_memory.c | 7 +-
mm/memcontrol.c | 246 +++++++++++++++
mm/memory.c | 9 +-
mm/mempolicy.c | 4 +
11 files changed, 1133 insertions(+), 7 deletions(-)
create mode 100644 drivers/numa/Makefile
create mode 100644 drivers/numa/numa_balancer.c

--
2.14.4.44.g2045bb6