Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node

From: Yang Shi
Date: Mon Mar 25 2019 - 16:05:11 EST




On 3/25/19 9:15 AM, Brice Goglin wrote:
Le 23/03/2019 Ã 05:44, Yang Shi a ÃcritÂ:
With Dave Hansen's patches merged into Linus's tree

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4

PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
effectively and efficiently is still a question.

There have been a couple of proposals posted on the mailing list [1] [2].

The patchset is aimed to try a different approach from this proposal [1]
to use PMEM as NUMA nodes.

The approach is designed to follow the below principles:

1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.

2. DRAM first/by default. No surprise to existing applications and default
running. PMEM will not be allocated unless its node is specified explicitly
by NUMA policy. Some applications may be not very sensitive to memory latency,
so they could be placed on PMEM nodes then have hot pages promote to DRAM
gradually.

I am not against the approach for some workloads. However, many HPC
people would rather do this manually. But there's currently no easy way
to find out from userspace whether a given NUMA node is DDR or PMEM*. We
have to assume HMAT is available (and correct) and look at performance
attributes. When talking to humans, it would be better to say "I
allocated on the local DDR NUMA node" rather than "I allocated on the
fastest node according to HMAT latency".

Yes, I agree to have some information exposed to kernel or userspace to tell what nodes are DRAM nodes what nodes are not (maybe HBM or PMEM). I assume the default allocation should end up on DRAM nodes for the most workloads. If someone would like to control this manually other than mempolicy, the default allocation node mask may be exported to user space by sysfs so that it can be changed on demand.


Also, when we'll have HBM+DDR, some applications may want to use DDR by
default, which means they want the *slowest* node according to HMAT (by
the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?).
Performance attributes could help, but how does user-space know for sure
that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years?

This is what I mentioned above we need the information exported from HMAT or anything similar to tell us what nodes are DRAM nodes since DRAM may be the lowest tier memory.

Or we may be able to assume the nodes associated with CPUs are DRAM nodes by assuming both HBM and PMEM is CPU less node.

Thanks,
Yang


It seems to me that exporting a flag in sysfs saying whether a node is
PMEM could be convenient. Patch series [1] exported a "type" in sysfs
node directories ("pmem" or "dram"). I don't know how if there's an easy
way to define what HBM is and expose that type too.

Brice

* As far as I know, the only way is to look at all DAX devices until you
find the given NUMA node in the "target_node" attribute. If none, you're
likely not PMEM-backed.


[1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@xxxxxxxxx/