Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
From: Chris Li
Date: Fri Jul 26 2024 - 00:51:15 EST
On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > If the freeing of swap entry is random distribution. You need 16
> > continuous swap entries free at the same time at aligned 16 base
> > locations. The total number of order 4 free swap space add up together
> > is much lower than the order 0 allocatable swap space.
> > If having one entry free is 50% probability(swapfile half full), then
> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>
> This depends on workloads. Quite some workloads will show some degree
> of spatial locality. For a workload with no spatial locality at all as
> above, mTHP may be not a good choice at the first place.
The fragmentation comes from the order 0 entry not from the mTHP. mTHP
have their own valid usage case, and should be separate from how you
use the order 0 entry. That is why I consider this kind of strategy
only works on the lucky case. I would much prefer the strategy that
can guarantee work not depend on luck.
> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> clusters available.
> >
> > Exactly.
> >
> >>
> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> the various situations automatically.
> >
> > There is no easy way to migrate swap entries to different locations.
> > That is why I like to have discontiguous swap entries allocation for
> > mTHP.
>
> We suggest to migrate non-full swap clsuters among different lists, not
> swap entries.
Then you have the down side of reducing the number of total high order
clusters. By chance it is much easier to fragment the cluster than
anti-fragment a cluster. The orders of clusters have a natural
tendency to move down rather than move up, given long enough time of
random access. It will likely run out of high order clusters in the
long run if we don't have any separation of orders.
> >> But yes, data is needed for any performance related change.
>
> BTW: I think non-full cluster isn't a good name. Partial cluster is
> much better and follows the same convention as partial slab.
I am not opposed to it. The only reason I hold off on the rename is
because there are patches from Kairui I am testing depending on it.
Let's finish up the V5 patch with the swap cache reclaim code path
then do the renaming as one batch job. We actually have more than one
list that has the clusters partially full. It helps reduce the repeat
scan of the cluster that is not full but also not able to allocate
swap entries for this order. Just the name of one of them as
"partial" is not precise either. Because the other lists are also
partially full. We'd better give them precise meaning systematically.
Chris