Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

From: Yang Shi
Date: Thu Apr 18 2019 - 15:23:57 EST

Next message: tip-bot for Dave Hansen: "[tip:x86/urgent] x86/mpx: Fix recursive munmap() corruption"
Previous message: Thomas Gleixner: "Re: [5.0.0 rc3 BUG] possible irq lock inversion dependency detected"
In reply to: Keith Busch: "Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node"
Next in thread: Zi Yan: "Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/18/19 11:16 AM, Keith Busch wrote:

On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:

On 4/17/19 2:23 AM, Michal Hocko wrote:

yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
the migration target. That should prevent from loops or artificial nodes
exhausting quite naturaly AFAICS. Maybe we will need some tricks to
raise the watermark but I am not convinced something like that is really
necessary.

I don't think GFP_NOWAIT alone is good enough.

Let's say we have a system full of clean page cache and only two nodes:
0 and 1. GFP_NOWAIT will eventually kick off kswapd on both nodes.
Each kswapd will be migrating pages to the *other* node since each is in
the other's fallback path.

I think what you're saying is that, eventually, the kswapds will see
allocation failures and stop migrating, providing hysteresis. This is
probably true.

But, I'm more concerned about that window where the kswapds are throwing
pages at each other because they're effectively just wasting resources
in this window. I guess we should figure our how large this window is
and how fast (or if) the dampening occurs in practice.

I'm still refining tests to help answer this and have some preliminary
data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
fast swap device. The test has an application strict mbind more than
the total memory to node 0, and forever writes random cachelines from
per-cpu threads.

Thanks for the test. A follow-up question, how about the size for each node? Is node 1 bigger than node 0? Since PMEM typically has larger capacity, so I'm wondering whether the capacity may make things different or not.

I'm testing two memory pressure policies:

Node 0 can migrate to Node 1, no cycles
Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)

After the initial ramp up time, the second policy is ~7-10% slower than
no cycles. There doesn't appear to be a temporary window dealing with
bouncing pages: it's just a slower overall steady state. Looks like when
migration fails and falls back to swap, the newly freed pages occasionaly
get sniped by the other node, keeping the pressure up.

Next message: tip-bot for Dave Hansen: "[tip:x86/urgent] x86/mpx: Fix recursive munmap() corruption"
Previous message: Thomas Gleixner: "Re: [5.0.0 rc3 BUG] possible irq lock inversion dependency detected"
In reply to: Keith Busch: "Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node"
Next in thread: Zi Yan: "Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]