Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

From: Barry Song
Date: Mon May 27 2024 - 23:07:26 EST


On Sat, May 25, 2024 at 5:17 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/
>
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
>
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
>
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test

Hi Chris,

Attached are the test results from a real phone using 4-order mTHP. The results
seem better overall, but after 7 hours, especially when the swap device becomes
full(soon some apps are killed to free memory and swap), the fallback
ratio still
reaches 100%.

I haven't debugged this, but my guess is that the cluster's order can
shift between
4-order and 0-order. Sometimes, they all shift to 0-order, and hardly can they
get back to 4-order.

> result here:
>
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
>
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> --ratio 1:0 --pipeline 8 -d 1024
>
> Before:
> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> After:
> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
>
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
>
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
>
> Reported-by: Barry Song <21cnbao@xxxxxxxxx>
> Tested-by: Kairui Song <kasong@xxxxxxxxxxx>
> Signed-off-by: Chris Li <chrisl@xxxxxxxxxx>
> ---
> Chris Li (2):
> mm: swap: swap cluster switch to double link list
> mm: swap: mTHP allocate swap entries from nonfull list
>
> include/linux/swap.h | 18 ++--
> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> 2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,
> --
> Chris Li <chrisl@xxxxxxxxxx>
>

Thanks
Barry

Attachment: chris-swap-patch.png
Description: PNG image