[PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

From: Chris Li
Date: Fri May 24 2024 - 13:17:37 EST


This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list. That works well for PMD size THP,
but it has a serius fragmentation issue reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/

The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.

The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.

Address this by remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This greatly improve the sucess rate of the mTHP swap allocation.
While I am still waiting for Barry's test result. I paste Kairui's test
result here:

I'm able to reproduce such an issue with a simple script (enabling all order of mthp):

modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0

rmdir /sys/fs/cgroup/benchmark
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 8G > memory.max
echo $$ > cgroup.procs

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &

/usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
--ratio 1:0 --pipeline 8 -d 1024

Before:
Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
After:
Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74

And the fallback ratio dropped by a lot:
Before:
hugepages-32kB/stats/anon_swpout_fallback:15997
hugepages-32kB/stats/anon_swpout:18712
hugepages-512kB/stats/anon_swpout_fallback:192
hugepages-512kB/stats/anon_swpout:0
hugepages-2048kB/stats/anon_swpout_fallback:2
hugepages-2048kB/stats/anon_swpout:0
hugepages-1024kB/stats/anon_swpout_fallback:0
hugepages-1024kB/stats/anon_swpout:0
hugepages-64kB/stats/anon_swpout_fallback:18246
hugepages-64kB/stats/anon_swpout:17644
hugepages-16kB/stats/anon_swpout_fallback:13701
hugepages-16kB/stats/anon_swpout:18234
hugepages-256kB/stats/anon_swpout_fallback:8642
hugepages-256kB/stats/anon_swpout:93
hugepages-128kB/stats/anon_swpout_fallback:21497
hugepages-128kB/stats/anon_swpout:7596

(Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)

After:
hugepages-32kB/stats/swpout:34445
hugepages-32kB/stats/swpout_fallback:0
hugepages-512kB/stats/swpout:1
hugepages-512kB/stats/swpout_fallback:134
hugepages-2048kB/stats/swpout:1
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:6
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35495
hugepages-64kB/stats/swpout_fallback:0
hugepages-16kB/stats/swpout:32441
hugepages-16kB/stats/swpout_fallback:0
hugepages-256kB/stats/swpout:2223
hugepages-256kB/stats/swpout_fallback:6278
hugepages-128kB/stats/swpout:29136
hugepages-128kB/stats/swpout_fallback:52

Reported-by: Barry Song <21cnbao@xxxxxxxxx>
Tested-by: Kairui Song <kasong@xxxxxxxxxxx>
Signed-off-by: Chris Li <chrisl@xxxxxxxxxx>
---
Chris Li (2):
mm: swap: swap cluster switch to double link list
mm: swap: mTHP allocate swap entries from nonfull list

include/linux/swap.h | 18 ++--
mm/swapfile.c | 252 +++++++++++++++++----------------------------------
2 files changed, 93 insertions(+), 177 deletions(-)
---
base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,
--
Chris Li <chrisl@xxxxxxxxxx>