With Dave Hansen's patches merged into Linus's tree
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA
node effectively and efficiently is worth exploring.
There have been a couple of proposals posted on the mailing list [1] [2] [3].
I already posted two versions of patchset for demoting/promoting memory pages
between DRAM and PMEM before this topic was discussed at LSF/MM 2019
(https://lwn.net/Articles/787418/). I do appreciate all the great suggestions
from the community. This updated version implemented the most discussion,
please see the below design section for the details.
Changelog
=========
v2 --> v3:
* Introduced "migrate mode" for node reclaim. Just do demotion when
"migrate mode" is specified per Michal Hocko and Mel Gorman.
* Introduced "migrate target" concept for VM per Mel Gorman. The memory nodes
which are under DRAM in the hierarchy (i.e. lower bandwidth, higher latency,
larger capacity and cheaper than DRAM) are considered as "migrate target"
nodes. When "migrate mode" is on, memory reclaim would demote pages to
the "migrate target" nodes.
* Dropped "twice access" promotion patch per Michal Hocko.
* Changed the subject for the patchset to reflect the update.
* Rebased to 5.2-rc1.
v1 --> v2:
* Dropped the default allocation node mask. The memory placement restriction
could be achieved by mempolicy or cpuset.
* Dropped the new mempolicy since its semantic is not that clear yet.
* Dropped PG_Promote flag.
* Defined N_CPU_MEM nodemask for the nodes which have both CPU and memory.
* Extended page_check_references() to implement "twice access" check for
anonymous page in NUMA balancing path.
* Reworked the memory demotion code.
v2: https://lore.kernel.org/linux-mm/1554955019-29472-1-git-send-email-yang.shi@xxxxxxxxxxxxxxxxx/
v1: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@xxxxxxxxxxxxxxxxx/
Design
======
With the development of new memory technology, we could have cheaper and
larger memory device on the system, which may have higher latency and lower
bandwidth than DRAM, i.e. PMEM. It could be used as persistent storage or
volatile memory.
It fits into the memory hierarchy as a second tier memory. The patchset
tries to explore an approach to utilize such memory to improve the memory
placement. Basically, the patchset tries to achieve this goal by doing
memory promotion/demotion via NUMA balancing and memory reclaim.
Introduce a new "migrate" mode for node reclaim. When DRAM has memory
pressure, demote pages to PMEM via node reclaim path if "migrate" mode is
on. Then NUMA balancing will promote pages to DRAM as long as the page is
referenced again. The memory pressure on PMEM node would push the inactive
pages of PMEM to disk via swap.
Introduce "primary" node and "migrate target" node concepts for VM (patch 1/9
and 2/9). The "primary" node is the node which has both CPU and memory. The
"migrate target" node is cpuless node and under DRAM in memory hierarchy
(i.e. PMEM may be a suitable one, which has lower bandwidth, higher latency,
larger capacity and is cheaper than DRAM). The firmware is effectively going
to enforce "cpu-less" nodes for any memory range that has differentiated
performance from the conventional memory pool, or differentiated performance
for a specific initiator.
Defined "N_CPU_MEM" nodemask for the "primary" nodes in order to distinguish
with cpuless nodes (memory only, i.e. PMEM nodes) and memoryless nodes (some
architectures, i.e. Power, may have memoryless nodes).
It is a little bit hard to find out suitable "migrate target" node since this
needs firmware exposes the physical characteristics of the memory devices.
I'm not quite sure what should be the best way and if it is ready to use now
or not. Since PMEM is the only available such device for now, so it sounds
retrieving the information from SRAT is the easiest way. We may figure out a
better way in the future.
The promotion/demotion happens only between "primary" nodes and "migrate target"
nodes. No promotion/demotion between "migrate target" nodes and promotion from
"primary" nodes to "migrate target" nodes and demotion from "primary" nodes to
"migrate target" nodes. This guarantees there is no cycles for memory demotion
or promotion.
According to the discussion at LFS/MM 2019, "there should only be one node to
which pages could be migrated". So reclaim code just tries to demote the pages
to the closest "migrate target" node and only tries once. Otherwise "if all
nodes in the system were on a fallback list, a page would have to move through
every possible option - each RAM-based node and each persistent-memory node -
before actually being reclaimed. It would be necessary to maintain the history
of where each page has been, and would be likely to disrupt other workloads on
the system". This is what v2 patchset does, so keep doing it in the same way
in v3.
The demotion code moves all the migration candidate pages into one single list,
then migrate them together (including THP). This would improve the efficiency
of migration according to Zi Yan's research. If the migration fails, the
unmigrated pages will be put back to LRU.
Use the most opotimistic GFP flags to allocate pages on the "migrate target"
node.
To reduce the failure rate of demotion, check if the "migrate target" node is
contended or not. If the "migrate target" node is contended, just do swap
instead of migrate. If migration is failed due to -ENOMEM, mark the node as
contended. The contended flag will be cleared once the node get balanced.
For now "migrate" mode is not compatible with cpuset and mempolicy since it
is hard to get the process's task_struct from struct page. The cpuset and
process's mempolicy are stored in task_struct instead of mm_struct.
Anonymous page only for the time being since NUMA balancing can't promote
unmapped page cache. Page cache can be demoted easily, but promotion is a
question, may do it via mark_page_accessed().
Added vmstat counters for pgdemote_kswapd, pgdemote_direct and
numa_pages_promoted.
There are definitely still a lot of details need to be sorted out. Any
comment is welcome.
Test
====
The stress test was done with mmtests + applications workload (i.e. sysbench,
grep, etc).
Generate memory pressure by running mmtest's usemem-stress-numa-compact,
then run other applications as workload to stress the promotion and demotion
path. The machine was still alive after the stress test had been running for
~30 hours. The /proc/vmstat also shows:
...
pgdemote_kswapd 3316563
pgdemote_direct 1930721
...
numa_pages_promoted 81838
[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@xxxxxxxxx/
[2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@xxxxxxxxx/
[3]: https://lore.kernel.org/linux-mm/20190404071312.GD12864@xxxxxxxxxxxxxx/T/#me1c1ed102741ba945c57071de9749e16a76e9f3d
Yang Shi (9):
mm: define N_CPU_MEM node states
mm: Introduce migrate target nodemask
mm: page_alloc: make find_next_best_node find return migration target node
mm: migrate: make migrate_pages() return nr_succeeded
mm: vmscan: demote anon DRAM pages to migration target node
mm: vmscan: don't demote for memcg reclaim
mm: vmscan: check if the demote target node is contended or not
mm: vmscan: add page demotion counter
mm: numa: add page promotion counter
Documentation/sysctl/vm.txt | 6 +++
drivers/acpi/numa.c | 12 +++++
drivers/base/node.c | 4 ++
include/linux/gfp.h | 12 +++++
include/linux/migrate.h | 6 ++-
include/linux/mmzone.h | 3 ++
include/linux/nodemask.h | 4 +-
include/linux/vm_event_item.h | 3 ++
include/linux/vmstat.h | 1 +
include/trace/events/migrate.h | 3 +-
mm/compaction.c | 3 +-
mm/debug.c | 1 +
mm/gup.c | 4 +-
mm/huge_memory.c | 4 ++
mm/internal.h | 23 ++++++++
mm/memory-failure.c | 7 ++-
mm/memory.c | 4 ++
mm/memory_hotplug.c | 10 +++-
mm/mempolicy.c | 7 ++-
mm/migrate.c | 33 ++++++++----
mm/page_alloc.c | 20 +++++--
mm/vmscan.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
mm/vmstat.c | 14 ++++-
23 files changed, 323 insertions(+), 47 deletions(-)