Re: [RFC PATCH 1/2] mm/swap, memcg: basic structure and logic for per cgroup swap priority control

From: YoungJun Park
Date: Mon Jul 07 2025 - 11:00:38 EST


On Mon, Jul 07, 2025 at 11:59:49AM +0200, Michal Koutný wrote:
> Hello.
>
> On Tue, Jul 01, 2025 at 10:08:46PM +0900, YoungJun Park <youngjun.park@xxxxxxx> wrote:
> > memory.swap.priority
> ...
>
> > To assign priorities to swap devices in the current cgroup,
> > write one or more lines in the following format:
> >
> > <swap_device_unique_id> <priority>
>
> How would the user know this unique_id? (I don't see it in /proc/swaps.)

The unique_id is a new concept I introduced to refer to assigned
swap devices. It's allocated whenever a swap device is turned on. I did
explore other key identifiers like the swap device path, but I
determined that providing a separate unique_id is more suitable for
this context. Initially, I proposed printing it directly from
memory.swap.priority to facilitate usage like:

$ swapon
NAME TYPE SIZE USED PRIO
/dev/sdb partition 300M 0B 10
/dev/sdc partition 300M 0B 5

$ cat memory.swap.priority
Active
/dev/sdb unique:1 prio:10
/dev/sdc unique:2 prio:5
Following your suggestion, I've deprecated this initial proposal and
considered four alternatives. I'm currently leaning towards
options 2 and 4, and I plan to propose option 4 as the primary
approach:

1. /proc/swaps with ID: We've rejected this due to potential ABI
changes.

2. New /proc interface: This could be /proc/swaps with the ID,
or a dedicated swapdevice file with the ID. While viable, I prefer
not to add new /proc interfaces if we can avoid it.

3. /sys/kernel/mm/swap/ location: (Similar to vma_ra_enabled)
This was rejected because sysfs typically shows configured values,
not dynamic identifiers, which would be inconsistent with existing
conventions.

4. Align memory.swap.priority.effective with /proc/swaps:
Aligning the order of id prio pairs in
memory.swap.priority.effective with the output order of
/proc/swaps would allow users to infer which swap device
corresponds to which ID. For example:

$ swapon
NAME TYPE SIZE USED PRIO
/dev/sdb partition 300M 0B 10
/dev/sdc partition 300M 0B 5

$ cat memory.swap.priority.effective
Active
1 10 // this is /dev/sdb
2 5 // this is /dev/sdc

> > Note:
> > A special value of -1 means the swap device is completely
> > excluded from use by this cgroup. Unlike the global swap
> > priority, where negative values simply lower the priority,
> > setting -1 here disables allocation from that device for the
> > current cgroup only.
>
> The divergence from the global semantics is little bit confusing.
> There should better be a special value (like 'disabled') in the interface.
> And possible second special value like 'none' that denotes the default
> (for new (unconfigured) cgroups or when a new swap device is activated).
>

Thank you for your insightful comments and suggestions regarding the
default values. I was initially focused on providing numerical values
for these settings. However, using keywords like "none" and
"disabled" for default values makes the semantics much more natural
and user-friendly.

Based on your feedback and the cgroup-v2.html documentation on default
values, I propose the following semantics:

none: This applies priority based on the global swap
priority. It's important to note that for negative priorities,
this implies following NUMA auto-binding rules, rather than a direct
application of the negative value itself.

disabled: This keyword explicitly excludes the swap device
from use by this cgroup.

Here's how these semantics would translate into usage:

echo "default none" > memory.swap.priority or
echo "none" > memory.swap.priority:
* When swapon is active, the cgroup's swap device priority will
follow the global swap priority.

echo "default disabled" > memory.swap.priority or
echo "default" > memory.swap.priority:
* When swapon is active, the swap device will be excluded from
allocation within this cgroup.

echo "<id> none" > memory.swap.priority:
* The specified swap device will follow its global swap priority.

echo "<id> disabled" > memory.swap.priority:
* The specified swap device will be excluded from allocation for
this cgroup.

echo "<id> <prio>" > memory.swap.priority:
* This sets a specific priority for the specified swap device.

> ...
> > In this case:
> > - If no cgroup sets any configuration, the output matches the
> > global `swapon` priority.
> > - If an ancestor has a configuration, the child inherits it
> > and ignores its own setting.
>
> The child's priority could be capped by ancestors' instead of wholy
> overwritten? (So that remains some effect both.)

Regarding the child's priority being capped or refined by ancestors'
settings, I've considered allowing the child's priority to resolve its
own settings when the sorted priority order is consistent and the
child's swap devices are a subset of the parent's. Here's a visual
representation of how that might work:

+-----------------+
| Parent cgroup |
| (Swaps: A, B, C)|
+--------+--------+
|
| (Child applies settings to its own children)
v
+--------+--------+
| Child cgroup |
| (Swaps: B, C) |
| (B & C resolved by child's settings)
+--------+--------+
|
+-------------------+
| |
v v
+--------+--------+ +--------+--------+
| Grandchild cgroup | | Grandchild 2 cgroup |
| (Swaps: C) | | (Swaps: A) |
| (C resolved by | | (A not in B,C; |
| grandchild's | | resolved by |
| child's settings)| | child's settings)|
+-------------------+ +-------------------+

However, this feature isn't currently required for our immediate use
case, and it adds notable complexity to the implementation. I suggest
we consider this as a next step if the current feature is integrated
into the kernel and sees widespread adoption or
any further use cases or requirements.

Best regards,
Youngjun Park