On Mon 15-04-19 17:09:07, Yang Shi wrote:
Why cannot we simply demote in the proximity order? Why do you make
On 4/12/19 1:47 AM, Michal Hocko wrote:
On Thu 11-04-19 11:56:50, Yang Shi wrote:In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes.
DesignI still believe you are overcomplicating this without a strong reason.
Basically, the approach is aimed to spread data from DRAM (closest to local
CPU) down further to PMEM and disk (typically assume the lower tier storage
is slower, larger and cheaper than the upper tier) by their hotness. The
patchset tries to achieve this goal by doing memory promotion/demotion via
NUMA balancing and memory reclaim as what the below diagram shows:
DRAM <--> PMEM <--> Disk
When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
Then NUMA balancing will promote pages to DRAM as long as the page is referenced
again. The memory pressure on PMEM node would push the inactive pages of PMEM
to disk via swap.
The promotion/demotion happens only between "primary" nodes (the nodes have
both CPU and memory) and PMEM nodes. No promotion/demotion between PMEM nodes
and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
that has differentiated performance from the conventional memory pool, or
differentiated performance for a specific initiator, per Dan Williams. So,
assuming PMEM nodes are cpuless nodes sounds reasonable.
However, cpuless nodes might be not PMEM nodes. But, actually, memory
promotion/demotion doesn't care what kind of memory will be the target nodes,
it could be DRAM, PMEM or something else, as long as they are the second tier
memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
pointless to do such demotion.
Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
Typically, memory allocation would happen on such nodes by default unless
cpuless nodes are specified explicitly, cpuless nodes would be just fallback
nodes, so they are also as known as "primary" nodes in this patchset. With
two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
demonstrate the promotion/demotion approach for now, and this looks more
architecture-independent. But it may be better to construct such node mask
by reading hardware information (i.e. HMAT), particularly for more complex
Why cannot we start simple and build from there? In other words I do not
think we really need anything like N_CPU_MEM at all.
They would be the preferred demotion target.Â Of course, we could rely on
firmware to just demote to the next best node, but it may be a "preferred"
node, if so I don't see too much benefit achieved by demotion. Am I missing
cpuless nodes so special? If other close nodes are vacant then just use
You definitely have to follow policy. You cannot demote to a node whichI would expect that the very first attempt wouldn't do much more thanDo you mean respect mempolicy or cpuset when doing demotion? I was wondering
migrate to-be-reclaimed pages (without an explicit binding) with a
this, but I didn't do so in the current implementation since it may need
walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
easier way to do so?
is outside of the cpuset/mempolicy because you are breaking contract
expected by the userspace. That implies doing a rmap walk.
I pressume this is a result of a synthetic workload, right? Or do youI would also not touch the numa balancing logic at this stage and ratherI agree we would prefer start from something simpler and see how it works.
see how the current implementation behaves.
The "twice access" optimization is aimed to reduce the PMEM bandwidth burden
since the bandwidth of PMEM is scarce resource. I did compare "twice access"
to "no twice access", it does save a lot bandwidth for some once-off access
pattern. For example, when running stress test with mmtest's
usemem-stress-numa-compact. The kernel would promote ~600,000 pages with
"twice access" in 4 hours, but it would promote ~80,000,000 pages without
have any numbers for a real life usecase?