Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs

From: Aneesh Kumar K V
Date: Mon Jun 06 2022 - 12:39:41 EST


On 6/6/22 9:46 PM, Jonathan Cameron wrote:
On Mon, 6 Jun 2022 21:31:16 +0530
Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> wrote:

On 6/6/22 8:29 PM, Jonathan Cameron wrote:
On Fri, 3 Jun 2022 14:10:47 +0530
Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> wrote:
On 5/27/22 7:45 PM, Jonathan Cameron wrote:
On Fri, 27 May 2022 17:55:23 +0530
"Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxxxxx> wrote:
From: Jagdish Gediya <jvgediya@xxxxxxxxxxxxx>

Add support to read/write the memory tierindex for a NUMA node.

/sys/devices/system/node/nodeN/memtier

where N = node id

When read, It list the memory tier that the node belongs to.

When written, the kernel moves the node into the specified
memory tier, the tier assignment of all other nodes are not
affected.

If the memory tier does not exist, writing to the above file
create the tier and assign the NUMA node to that tier.
creates

There was some discussion in v2 of Wei Xu's RFC that what matter
for creation is the rank, not the tier number.

My suggestion is move to an explicit creation file such as
memtier/create_tier_from_rank
to which writing the rank gives results in a new tier
with the next device ID and requested rank.

I think the below workflow is much simpler.

:/sys/devices/system# cat memtier/memtier1/nodelist
1-3
:/sys/devices/system# cat node/node1/memtier
1
:/sys/devices/system# ls memtier/memtier*
nodelist power rank subsystem uevent
/sys/devices/system# ls memtier/
default_rank max_tier memtier1 power uevent
:/sys/devices/system# echo 2 > node/node1/memtier
:/sys/devices/system#

:/sys/devices/system# ls memtier/
default_rank max_tier memtier1 memtier2 power uevent
:/sys/devices/system# cat memtier/memtier1/nodelist
2-3
:/sys/devices/system# cat memtier/memtier2/nodelist
1
:/sys/devices/system#

ie, to create a tier we just write the tier id/tier index to
node/nodeN/memtier file. That will create a new memory tier if needed
and add the node to that specific memory tier. Since for now we are
having 1:1 mapping between tier index to rank value, we can derive the
rank value from the memory tier index.

For dynamic memory tier support, we can assign a rank value such that
new memory tiers are always created such that it comes last in the
demotion order.

I'm not keen on having to pass through an intermediate state where
the rank may well be wrong, but I guess it's not that harmful even
if it feels wrong ;)

Any new memory tier added can be of lowest rank (rank - 0) and hence
will appear as the highest memory tier in demotion order.

Depends on driver interaction - if new memory is CXL attached or
GPU attached, chances are the driver has an input on which tier
it is put in by default.

User can then
assign the right rank value to the memory tier? Also the actual demotion
target paths are built during memory block online which in most case
would happen after we properly verify that the device got assigned to
the right memory tier with correct rank value?

Agreed, though that may change the model of how memory is brought online
somewhat.


Races are potentially a bit of a pain though depending on what we
expect the usage model to be.

There are patterns (CXL regions for example) of guaranteeing the
'right' device is created by doing something like

cat create_tier > temp.txt
#(temp gets 2 for example on first call then
# next read of this file gets 3 etc)

cat temp.txt > create_tier
# will fail if there hasn't been a read of the same value

Assuming all software keeps to the model, then there are no
race conditions over creation. Otherwise we have two new
devices turn up very close to each other and userspace scripting
tries to create two new tiers - if it races they may end up in
the same tier when that wasn't the intent. Then code to set
the rank also races and we get two potentially very different
memories in a tier with a randomly selected rank.

Fun and games... And a fine illustration why sysfs based 'device'
creation is tricky to get right (and lots of cases in the kernel
don't).

I would expect userspace to be careful and verify the memory tier and
rank value before we online the memory blocks backed by the device. Even
if we race, the result would be two device not intended to be part of
the same memory tier appearing at the same tier. But then we won't be
building demotion targets yet. So userspace could verify this, move the
nodes out of the memory tier. Once it is verified, memory blocks can be
onlined.

The race is there and not avoidable as far as I can see. Two processes A and B.

A checks for a spare tier number
B checks for a spare tier number
A tries to assign node 3 to new tier 2 (new tier created)
B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
is the same method we'd use to put it in the existing tier we can't tell this
write was meant to create a new tier).
A writes rank 100 to tier 2
A checks rank for tier 2 and finds it is 100 as expected.
B write rank 200 to tier 2 (it could check if still default but even that is racy)
B checks rank for tier 2 rank and finds it is 200 as expected.
A onlines memory.
B onlines memory.

Both think they got what they wanted, but A definitely didn't.

One work around is the read / write approach and create_tier.

A reads create_tier - gets 2.
B reads create_tier - gets 3.
A writes 2 to create_tier as that's what it read.
B writes 3 to create_tier as that's what it read.

continue with created tiers. Obviously can exhaust tiers, but if this is
root only, could just create lots anyway so no worse off.

Having said that can you outline the usage of
memtier/create_tier_from_rank ?

There are corner cases to deal with...

A writes 100 to create_tier_from_rank.
A goes looking for matching tier - finds it: tier2
B writes 200 to create_tier_from_rank
B goes looking for matching tier - finds it: tier3

rest is fine as operating on different tiers.

Trickier is
A writes 100 to create_tier_from_rank - succeed.
B writes 100 to create_tier_from_rank - Could fail, or could just eat it?

Logically this is same as separate create_tier and then a write
of rank, but in one operation, but then you need to search
for the right one. As such, perhaps a create_tier
that does the read/write pair as above is the best solution.


This all is good when we allow dynamic rank values. But currently we are restricting ourselves to three rank value as below:

rank memtier
300 memtier0
200 memtier1
100 memtier2

Now with the above, how do we define a write to create_tier_from_rank. What should be the behavior if user write value other than above defined rank values? Also enforcing the above three rank values as supported implies teaching userspace about them. I am trying to see how to fit
create_tier_from_rank without requiring the above.

Can we look at implementing create_tier_from_rank when we start supporting dynamic tiers/rank values? ie,

we still allow node/nodeN/memtier. But with dynamic tiers a race free
way to get a new memory tier would be echo rank > memtier/create_tier_from_rank. We could also say, memtier0/1/2 are kernel defined memory tiers. Writing to memtier/create_tier_from_rank will create new memory tiers above memtier2 with the rank value specified?

-aneesh