Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

From: Chris Li
Date: Thu May 30 2024 - 14:32:16 EST


On Thu, May 30, 2024 at 1:08 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >
> > Chris Li <chrisl@xxxxxxxxxx> writes:
> >
> > > Hi Ying,
> > >
> > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intelcom> wrote:
> > >>
> > >> Chris Li <chrisl@xxxxxxxxxx> writes:
> > >>
> > >> > I am spinning a new version for this series to address two issues
> > >> > found in this series:
> > >> >
> > >> > 1) Oppo discovered a bug in the following line:
> > >> > + ci = si->cluster_info + tmp;
> > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> > >> > That is a serious bug but trivial to fix.
> > >> >
> > >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> > >> > the cluster->order.
> > >>
> > >> IIUC, now, we only scan swap_map[] only if
> > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> > >> That is, if you doesn't run low swap free space, you will not do that.
> > >
> > > You can still swap space in order 0 clusters while order 4 runs out of
> > > free_cluster
> > > or nonfull_clusters[order]. For Android that is a common case.
> >
> > When we fail to allocate order 4, we will fallback to order 0. Still
> > don't need to scan swap_map[]. But after looking at your below reply, I
> > realized that the swap space is almost full at most times in your cases.
> > Then, it's possible that we run into scanning swap_map[].
> > list_empty(&si->free_clusters) &&
> > list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> > many clusters in si->percpu_cluster. So, if we want to avoid to scan
> > swap_map[], we can stop add clusters in si->percpu_cluster when swap
> > space runs low. And maybe take clusters out of si->percpu_cluster
> > sometimes.
>
> Stop adding when it runs low seems too late, there could still be a
> free cluster stuck on a CPU, and not getting scanned, right?

The free clusters stuck on the CPU are a small number. Only a handful
of clusters. Preventing low order swap polluting the high order
cluster is more urgent.

>
> > Another issue is nonfull_cluster[order1] cannot be used for
> > nonfull_cluster[order2]. In definition, we should not fail order 0
> > allocation, we need to steal nonfull_cluster[order>0] for order 0
> > allocation. This can avoid to scan swap_map[] too. This may be not
> > perfect, but it is the simplest first step implementation. You can
> > optimize based on it further.
>
> This can be extended to allow any order < MAX_ORDER to steal from
> higher order, which might increase fragmentation though.

Steal from higher order is a bad thing. Because the value of the
allocator is able to allocate from higher order.
High to low is always trivil, the low to high is impossible. See the
other email having a "knob" to reserve some swap space for high order
allocations. That is not perfect but more useful.

>
> So this is looking more and more like a buddy allocator, and that
> should be the long term solution.
>
In Barry's test case, there is a huge swing of order 0 and order 4
allocation caused by the low memory killer. Apps get killed and take a
while for the app to launch and swap out high order entries. The buddy
allocator will have limited help there because once cluster is used
for order 0, the fragmentation will prevent higher order allocation.
Buddy allocator might not be able to help much in this situation. We
do need a way to swap out large folios using discontiguous swap
entries. That is the longer term solution to Barry's usage situation.

Chris