Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node

From: Brice Goglin
Date: Mon Mar 25 2019 - 12:15:18 EST

Le 23/03/2019 Ã 05:44, Yang Shi a ÃcritÂ:
> With Dave Hansen's patches merged into Linus's tree
> PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
> effectively and efficiently is still a question.
> There have been a couple of proposals posted on the mailing list [1] [2].
> The patchset is aimed to try a different approach from this proposal [1]
> to use PMEM as NUMA nodes.
> The approach is designed to follow the below principles:
> 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
> 2. DRAM first/by default. No surprise to existing applications and default
> running. PMEM will not be allocated unless its node is specified explicitly
> by NUMA policy. Some applications may be not very sensitive to memory latency,
> so they could be placed on PMEM nodes then have hot pages promote to DRAM
> gradually.

I am not against the approach for some workloads. However, many HPC
people would rather do this manually. But there's currently no easy way
to find out from userspace whether a given NUMA node is DDR or PMEM*. We
have to assume HMAT is available (and correct) and look at performance
attributes. When talking to humans, it would be better to say "I
allocated on the local DDR NUMA node" rather than "I allocated on the
fastest node according to HMAT latency".

Also, when we'll have HBM+DDR, some applications may want to use DDR by
default, which means they want the *slowest* node according to HMAT (by
the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
Performance attributes could help, but how does user-space know for sure
that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?

It seems to me that exporting a flag in sysfs saying whether a node is
PMEM could be convenient. Patch series [1] exported a "type" in sysfs
node directories ("pmem" or "dram"). I don't know how if there's an easy
way to define what HBM is and expose that type too.


* As far as I know, the only way is to look at all DAX devices until you
find the given NUMA node in the "target_node" attribute. If none, you're
likely not PMEM-backed.

> [1]: