On 07/04/2024 07:02, Huang, Ying wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:
On 03.04.24 13:40, Ryan Roberts wrote:
Multi-size THP enables performance improvements by allocating large,
pte-mapped folios for anonymous memory. However I've observed that on an
arm64 system running a parallel workload (e.g. kernel compilation)
across many cores, under high memory pressure, the speed regresses. This
is due to bottlenecking on the increased number of TLBIs added due to
all the extra folio splitting when the large folios are swapped out.
Therefore, solve this regression by adding support for swapping out
mTHP
without needing to split the folio, just like is already done for
PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
and when the swap backing store is a non-rotating block device. These
are the same constraints as for the existing PMD-sized THP swap-out
support.
Note that no attempt is made to swap-in (m)THP here - this is still
done
page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
prerequisite for swapping-in mTHP.
The main change here is to improve the swap entry allocator so that
it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller can fall back to splitting the folio
and allocates individual entries (as per existing PMD-sized THP
fallback).
The per-order current clusters are maintained per-cpu using the
existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.
As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.
This change only modifies swap to be able to accept any order
mTHP. It
doesn't change the callers to elide doing the actual split. That will be
done in separate changes.
Reviewed-by: "Huang, Ying" <ying.huang@xxxxxxxxx>
Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
---
include/linux/swap.h | 10 ++-
mm/swap_slots.c | 6 +-
mm/swapfile.c | 175 ++++++++++++++++++++++++-------------------
3 files changed, 109 insertions(+), 82 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e1e4f5bf0cb..11c53692f65f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -268,13 +268,19 @@ struct swap_cluster_info {
*/
#define SWAP_NEXT_INVALID 0
+#ifdef CONFIG_THP_SWAP
+#define SWAP_NR_ORDERS (PMD_ORDER + 1)
+#else
+#define SWAP_NR_ORDERS 1
+#endif
+
/*
* We assign a cluster to each CPU, so each CPU can allocate swap entry from
* its own cluster and swapout sequentially. The purpose is to optimize swapout
* throughput.
*/
struct percpu_cluster {
- unsigned int next; /* Likely next allocation offset */
+ unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
};
struct swap_cluster_list {
@@ -471,7 +477,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
bool folio_free_swap(struct folio *folio);
void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 53abeaf1371d..13ab3b771409 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -264,7 +264,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
cache->cur = 0;
if (swap_slot_cache_active)
cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
- cache->slots, 1);
+ cache->slots, 0);
return cache->nr;
}
@@ -311,7 +311,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
if (folio_test_large(folio)) {
if (IS_ENABLED(CONFIG_THP_SWAP))
- get_swap_pages(1, &entry, folio_nr_pages(folio));
+ get_swap_pages(1, &entry, folio_order(folio));
The only comment I have is that this nr_pages -> order conversion adds
a bit of noise to this patch.
AFAIKS, it's primarily only required for "cluster->next[order]",
everything else doesn't really require the order.
I'd just have split that out into a separate patch, or simply
converted nr_pages -> order where required.
Nothing jumped at me, but I'm not an expert on that code, so I'm
mostly trusting the others ;)
The nr_pages -> order conversion replaces ilog2(nr_pages) with
(1<<order). IIUC, "<<" is a little faster than "ilog2()". And, we
don't need to worry about whether nr_pages is a power of 2. Do you
think that this makes sense?
I think that David's point was that I should just split out that change to its
own patch to aid readability? I'm happy to do that if no one objects.