On Sat 23-03-19 12:44:25, Yang Shi wrote:
With Dave Hansen's patches merged into Linus's treeWhy are you pushing yourself into the corner right at the beginning? If
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4
PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node
effectively and efficiently is still a question.
There have been a couple of proposals posted on the mailing list [1] [2].
The patchset is aimed to try a different approach from this proposal [1]
to use PMEM as NUMA nodes.
The approach is designed to follow the below principles:
1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, etc.
2. DRAM first/by default. No surprise to existing applications and default
running. PMEM will not be allocated unless its node is specified explicitly
by NUMA policy. Some applications may be not very sensitive to memory latency,
so they could be placed on PMEM nodes then have hot pages promote to DRAM
gradually.
the PMEM is exported as a regular NUMA node then the only difference
should be performance characteristics (module durability which shouldn't
play any role in this particular case, right?). Applications which are
already sensitive to memory access should better use proper binding already.
Some NUMA topologies might have quite a large interconnect penalties
already. So this doesn't sound like an argument to me, TBH.
5. Control memory allocation and hot/cold pages promotion/demotion on per VMAWhat does that mean? Anon vs. file backed memory?
basis.
[...]
2. Introduce a new mempolicy, called MPOL_HYBRID to keep other mempolicyThe above restriction pushes you to invent an API which is not really
semantics intact. We would like to have memory placement control on per process
or even per VMA granularity. So, mempolicy sounds more reasonable than madvise.
The new mempolicy is mainly used for launching processes on PMEM nodes then
migrate hot pages to DRAM nodes via NUMA balancing. MPOL_BIND could bind to
PMEM nodes too, but migrating to DRAM nodes would just break the semantic of
it. MPOL_PREFERRED can't constraint the allocation to PMEM nodes. So, it sounds
a new mempolicy is needed to fulfill the usecase.
trivial to get right and it seems quite artificial to me already.
3. The new mempolicy would promote pages to DRAM via NUMA balancing. IMHO, IThis is what the kernel does all the time. We call it memory reclaim.
don't think kernel is a good place to implement sophisticated hot/cold page
distinguish algorithm due to the complexity and overhead. But, kernel should
have such capability. NUMA balancing sounds like a good start point.
4. Promote twice faulted page. Use PG_promote to track if a page is faultedI am sorry, but page flags are an extremely scarce resource and a new
twice. This is an optimization to NUMA balancing to reduce the migration
thrashing and overhead for migrating from PMEM.
flag is extremely hard to get. On the other hand we already do have
use-twice detection for mapped page cache (see page_check_references). I
believe we can generalize that to anon pages as well.
5. When DRAM has memory pressure, demote page to PMEM via page reclaim path.Yes, this sounds like a good idea to me ;)
This is quite similar to other proposals. Then NUMA balancing will promote
page to DRAM as long as the page is referenced again. But, the
promotion/demotion still assumes two tier main memory. And, the demotion may
break mempolicy.
6. Anonymous page only for the time being since NUMA balancing can't promoteAs long as the nvdimm access is faster than the regular storage then
unmapped page cache.
using any node (including pmem one) should be OK.