Re: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders
From: Ryan Roberts
Date: Wed Mar 20 2024 - 08:22:33 EST
Hi Huang, Ying,
On 12/03/2024 07:51, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>
>> Multi-size THP enables performance improvements by allocating large,
>> pte-mapped folios for anonymous memory. However I've observed that on an
>> arm64 system running a parallel workload (e.g. kernel compilation)
>> across many cores, under high memory pressure, the speed regresses. This
>> is due to bottlenecking on the increased number of TLBIs added due to
>> all the extra folio splitting when the large folios are swapped out.
>>
>> Therefore, solve this regression by adding support for swapping out mTHP
>> without needing to split the folio, just like is already done for
>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>> and when the swap backing store is a non-rotating block device. These
>> are the same constraints as for the existing PMD-sized THP swap-out
>> support.
>>
>> Note that no attempt is made to swap-in (m)THP here - this is still done
>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>> prerequisite for swapping-in mTHP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller can fall back to splitting the folio
>> and allocates individual entries (as per existing PMD-sized THP
>> fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> This change only modifies swap to be able to accept any order mTHP. It
>> doesn't change the callers to elide doing the actual split. That will be
>> done in separate changes.
[...]
>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> }
>>
>> if (si->swap_map[offset]) {
>> + VM_WARN_ON(order > 0);
>> unlock_cluster(ci);
>> if (!n_ret)
>> goto scan;
>> else
>> goto done;
>> }
>> - WRITE_ONCE(si->swap_map[offset], usage);
>> - inc_cluster_info_page(si, si->cluster_info, offset);
>> + memset(si->swap_map + offset, usage, nr_pages);
>
> Add barrier() here corresponds to original WRITE_ONCE()?
> unlock_cluster(ci) may be NOP for some swap devices.
Looking at this a bit more closely, I'm not sure this is needed. Even if there
is no cluster, the swap_info is still locked, so unlocking that will act as a
barrier. There are a number of other callsites that memset(si->swap_map) without
an explicit barrier and with the swap_info locked.
Looking at the original commit that added the WRITE_ONCE() it was worried about
a race with reading swap_map in _swap_info_get(). But that site is now annotated
with a data_race(), which will suppress the warning. And I don't believe there
are any places that read swap_map locklessly and depend upon observing ordering
between it and other state? So I think the si unlock is sufficient?
I'm not planning to add barrier() here. Let me know if you disagree.
Thanks,
Ryan
>
>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>> unlock_cluster(ci);