Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin

From: Huang, Ying
Date: Wed Mar 27 2024 - 04:18:35 EST

Next message: Varadarajan Narayanan: "[PATCH v3 0/3] Add interconnect driver for IPQ9574 SoC"
Previous message: Tony Lindgren: "[PATCH 5/5] bus: ti-sysc: Drop legacy idle quirk handling"
In reply to: Kairui Song: "Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin"
Next in thread: Barry Song: "Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Kairui Song <ryncsn@xxxxxxxxx> writes:

> On Wed, Mar 27, 2024 at 2:49 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Kairui Song <ryncsn@xxxxxxxxx> writes:
>>
>> > On Wed, Mar 27, 2024 at 2:24 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >>
>> >> Kairui Song <ryncsn@xxxxxxxxx> writes:
>> >>
>> >> > From: Kairui Song <kasong@xxxxxxxxxxx>
>> >> >
>> >> > Interestingly the major performance overhead of synchronous is actually
>> >> > from the workingset nodes update, that's because synchronous swap in
>> >>
>> >> If it's the major overhead, why not make it the first optimization?
>> >
>> > This performance issue became much more obvious after doing other
>> > optimizations, and other optimizations are for general swapin not only
>> > for synchronous swapin, that's also how I optimized things step by
>> > step, so I kept my patch order...
>> >
>> > And it is easier to do this after Patch 8/10 which introduces the new
>> > interface for swap cache.
>> >
>> >>
>> >> > keeps adding single folios into a xa_node, making the node no longer
>> >> > a shadow node and have to be removed from shadow_nodes, then remove
>> >> > the folio very shortly and making the node a shadow node again,
>> >> > so it has to add back to the shadow_nodes.
>> >>
>> >> The folio is removed only if should_try_to_free_swap() returns true?
>> >>
>> >> > Mark synchronous swapin folio with a special bit in swap entry embedded
>> >> > in folio->swap, as we still have some usable bits there. Skip workingset
>> >> > node update on insertion of such folio because it will be removed very
>> >> > quickly, and will trigger the update ensuring the workingset info is
>> >> > eventual consensus.
>> >>
>> >> Is this safe? Is it possible for the shadow node to be reclaimed after
>> >> the folio are added into node and before being removed?
>> >
>> > If a xa node contains any non-shadow entry, it can't be reclaimed,
>> > shadow_lru_isolate will check and skip such nodes in case of race.
>>
>> In shadow_lru_isolate(),
>>
>> /*
>> * The nodes should only contain one or more shadow entries,
>> * no pages, so we expect to be able to remove them all and
>> * delete and free the empty node afterwards.
>> */
>> if (WARN_ON_ONCE(!node->nr_values))
>> goto out_invalid;
>> if (WARN_ON_ONCE(node->count != node->nr_values))
>> goto out_invalid;
>>
>> So, this isn't considered normal and will cause warning now.
>
> Yes, I added an exception in this patch:
> - if (WARN_ON_ONCE(node->count != node->nr_values))
> + if (WARN_ON_ONCE(node->count != node->nr_values &&
> mapping->host != NULL))
>
> The code is not a good final solution, but the idea might not be that
> bad, list_lru provides many operations like LRU_ROTATE, we can even
> lazy remove all the nodes as a general optimization, or add a
> threshold for adding/removing a node from LRU.

We can compare different solutions. For this one, we still need to deal
with the cases where the folio isn't removed from the swap cache, that
is, should_try_to_free_swap() returns false.

>>
>> >>
>> >> If so, we may consider some other methods. Make shadow_nodes per-cpu?
>> >
>> > That's also an alternative solution if there are other risks.
>>
>> This appears a general optimization and more clean.
>
> I'm not sure if synchronization between CPUs will make more burden,
> because shadow nodes are globally shared, one node can be referenced
> by multiple CPUs, I can have a try to see if this is doable. Maybe a
> per-cpu batch is better but synchronization might still be an issue.

Yes. Per-CPU shadow_nodes needs to find list from shadow node. That
has some overhead.

If lock contention on list_lru lock is the root cause, we can use hashed
shadow node lists. That can reduce lock contention effectively.

--
Best Regards,
Huang, Ying

Next message: Varadarajan Narayanan: "[PATCH v3 0/3] Add interconnect driver for IPQ9574 SoC"
Previous message: Tony Lindgren: "[PATCH 5/5] bus: ti-sysc: Drop legacy idle quirk handling"
In reply to: Kairui Song: "Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin"
Next in thread: Barry Song: "Re: [RFC PATCH 10/10] mm/swap: optimize synchronous swapin"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]