Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

From: Huang, Ying
Date: Thu May 30 2024 - 22:05:44 EST


Chris Li <chrisl@xxxxxxxxxx> writes:

> On Wed, May 29, 2024 at 1:48 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Chris Li <chrisl@xxxxxxxxxx> writes:
>>
>> > Previously, the swap cluster used a cluster index as a pointer
>> > to construct a custom single link list type "swap_cluster_list".
>> > The next cluster pointer is shared with the cluster->count.
>> > The assumption is that only the free cluster needs to be put
>> > on the list.
>> >
>> > That assumption is not true for mTHP allocators any more.
>>
>> I think that the words aren't correct here. mTHP swap entry allocators
>> can work with current cluster definition.
>
> The current behavior is very problematic though:
>
> If we only allocate and free order 4 swap entries, nothing else. After
> a while, the free cluster will be used up, the swap entry allocation
> will fail even though there is a lot of swap space left.

The original behavior doesn't work well for order-0 allocation too.
percpu_cluster and quick (cluster) scan path cannot be used for
fragmented swap devices.

>> > Need to track the non full cluster on the list as well. Move the
>> > current cluster single link list into standard double link list.
>>
>> It's an optimization to track non-full cluster with a list.
>>
>> I understand that you want to change cluster list definition. I just
>
> In my mind, I was changing the list implementation so it can be
> tracked non free cluster as well.
>
>> feel the wording isn't accurate.
>
> Help me improve it. I am happy to adjust the wording in V2, you can
> provide more feedback then.

I suggest you to focus on improvement. The original implementation
hasn't the assumption that it's the best or perfect. It improves from
its base and you can continue to improve it for more situations.
Describe the situation where current implementation doesn't performance
well and how do you improve it. Better with the cost.

>>
>> > Remove the cluster getter/setter for accessing the cluster
>> > struct member. Move the cluster locking in the caller function
>> > rather than the getter/setter function. That way the locking can
>> > protect more than one member, e.g. cluster->flag.
>>
>> Sorry, I don't understand the locking in above words. I don't find that
>> we lock/unlock in the original getter/setter functions. I found that
>> the cluster locking rule for cluster list is changed. Better to make
>> this explicit.
>
> The original cluster single link list add/remove will require si->lock
> protection as well, because the list head and tail pointer are outside
> of the cluster pointer.
> In this regard, the cluster double link list locking rule is very
> similar. Yes, I move the list_del() outside of the cluster lock, is
> that what you are referring to as the locking change?

In the original implementation, ci->lock is held when changing ci->data
(in fact next here). Now, you don't need the ci->lock. This is a
locking rule change, I suggest you to make it explicit in change log and
comments.

>> > Change cluster code to use "struct swap_cluster_info *" to
>> > reference the cluster rather than by using index. That is more
>> > consistent with the list manipulation. It avoids the repeat
>> > adding index to the cluser_info. The code is easier to understand.
>> >
>> > Remove the cluster next pointer is NULL flag, the double link
>> > list can handle the empty list pretty well.
>> >
>> > The "swap_cluster_info" struct is two pointer bigger, because
>> > 512 swap entries share one swap struct, it has very little impact
>> > on the average memory usage per swap entry. Other than the list
>> > conversion, there is no real function change in this patch.
>>
>> On 64bit platform, the size of swap_cluster_info increases from 8 bytes
>> to 24 bytes. For a 1TB swap device, the memory usage will increase from
>> 4MB to 12MB. This looks OK for me.
>
> Will add the size change calculation in V2 and have you review it again.
>
>>
>> Another choice is to use a customized double linked list using "unsigned
>> int" as pointer to cluster. That will reduce the size of cluster to 16
>> bytes. But it may be not necessary to do that.
>
> We can always do that as a follow up step to optimize the 24 byte to
> 16 byte, at the price of more code complicity.
> The trick part is the link list head, it is not part of the cluster
> array, it does not have an index, and will need a special handle for
> that.

In theory, you can define a "struct list_u32_head" and a set of
list_u32_* functions. But I don't find that it's necessary to do that.

>>
>> Anyway, I think that it's better to add more calculation in changelog
>> for memory usage increment.
>
> Sure, I will adjust the commit message in V2.
>
> Chris
>
>>
>> > ---
>> > include/linux/swap.h | 14 ++--
>> > mm/swapfile.c | 231 ++++++++++++++-------------------------------------
>> > 2 files changed, 68 insertions(+), 177 deletions(-)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 11c53692f65f..0d3906eff3c9 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -254,11 +254,12 @@ struct swap_cluster_info {
>> > * elements correspond to the swap
>> > * cluster
>> > */
>> > - unsigned int data:24;
>> > + unsigned int count:16;
>> > unsigned int flags:8;
>>
>> If we use 16bits and 8 bits in bit fields, why not just use u8 and u16
>> instead?
> Not sure about the

?

u16 count;
u8 flags;

>>
>> > + struct list_head next;
>>
>> "next" isn't a good naming because prev pointer is in list_head too.
>> The common naming is "list".
>
> Sure, I can change it to "list".
>
>>
>> Need to revise comments for swap_cluster_info.lock and add the locking
>> rule comments for swap_cluster_info.next.
>
> Will do.
>
>>
>> > };
>> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> > +
>> >

[snip]

--
Best Regards,
Huang, Ying