Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()

From: David Hildenbrand
Date: Fri Apr 05 2024 - 06:13:26 EST

Next message: Honggyu Kim: "Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory"
Previous message: Przemek Kitszel: "Re: [PATCH net-next v9 4/9] slab: introduce kvmalloc_array_node() and kvcalloc_node()"
In reply to: Ryan Roberts: "[PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()"
Next in thread: Ryan Roberts: "[PATCH v6 3/6] mm: swap: Simplify struct percpu_cluster"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 03.04.24 13:40, Ryan Roberts wrote:

Now that we no longer have a convenient flag in the cluster to determine
if a folio is large, free_swap_and_cache() will take a reference and
lock a large folio much more often, which could lead to contention and
(e.g.) failure to split large folios, etc.

Let's solve that problem by batch freeing swap and cache with a new
function, free_swap_and_cache_nr(), to free a contiguous range of swap
entries together. This allows us to first drop a reference to each swap
slot before we try to release the cache folio. This means we only try to
release the folio once, only taking the reference and lock once - much
better than the previous 512 times for the 2M THP case.

Contiguous swap entries are gathered in zap_pte_range() and
madvise_free_pte_range() in a similar way to how present ptes are
already gathered in zap_pte_range().

While we are at it, let's simplify by converting the return type of both
functions to void. The return value was used only by zap_pte_range() to
print a bad pte, and was ignored by everyone else, so the extra
reporting wasn't exactly guaranteed. We will still get the warning with
most of the information from get_swap_device(). With the batch version,
we wouldn't know which pte was bad anyway so could print the wrong one.

Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
---
include/linux/pgtable.h | 28 ++++++++++++++
include/linux/swap.h | 12 ++++--
mm/internal.h | 48 +++++++++++++++++++++++
mm/madvise.c | 12 ++++--
mm/memory.c | 13 ++++---
mm/swapfile.c | 86 ++++++++++++++++++++++++++++++++---------
6 files changed, 167 insertions(+), 32 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a3fc8150b047..0278259f7078 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
}
#endif
+#ifndef clear_not_present_full_ptes
+/**
+ * clear_not_present_full_ptes - Clear consecutive not present PTEs.

Consecutive only in the page table or also in some other sense?

I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any offset/pfn.

Consider document that.

+ * @mm: Address space the ptes represent.
+ * @addr: Address of the first pte.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over pte_clear_not_present_full().
+ *
+ * Context: The caller holds the page table lock. The PTEs are all not present.
+ * The PTEs are all in the same PMD.
+ */
+static inline void clear_not_present_full_ptes(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+ for (;;) {
+ pte_clear_not_present_full(mm, addr, ptep, full);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
unsigned long address,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6f78198f000..5737236dc3ce 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
@@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct *si)
#define free_pages_and_swap_cache(pages, nr) \
release_pages((pages), (nr));

[...]

+
+/**
+ * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
+ * @start_ptep: Page table pointer for the first entry.
+ * @max_nr: The maximum number of table entries to consider.
+ * @entry: Swap entry recovered from the first table entry.
+ *
+ * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
+ * containing swap entries all with consecutive offsets and targeting the same
+ * swap type.
+ *

Likely you should document that any swp pte bits are ignored? ()

+ * max_nr must be at least one and must be limited by the caller so scanning
+ * cannot exceed a single page table.
+ *
+ * Return: the number of table entries in the batch.
+ */
+static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
+ swp_entry_t entry)
+{
+ const pte_t *end_ptep = start_ptep + max_nr;
+ unsigned long expected_offset = swp_offset(entry) + 1;
+ unsigned int expected_type = swp_type(entry);
+ pte_t *ptep = start_ptep + 1;
+
+ VM_WARN_ON(max_nr < 1);
+ VM_WARN_ON(non_swap_entry(entry));
+
+ while (ptep < end_ptep) {
+ pte_t pte = ptep_get(ptep);
+
+ if (pte_none(pte) || pte_present(pte))
+ break;
+
+ entry = pte_to_swp_entry(pte);
+
+ if (non_swap_entry(entry) ||
+ swp_type(entry) != expected_type ||
+ swp_offset(entry) != expected_offset)
+ break;
+
+ expected_offset++;
+ ptep++;
+ }
+
+ return ptep - start_ptep;
+}

Looks very clean :)

I was wondering whether we could similarly construct the expected swp PTE and only check pte_same.

expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));

.. or have a variant to increase only the swp offset for an existing pte. But non-trivial due to the arch-dependent format.

But then, we'd fail on mismatch of other swp pte bits.

On swapin, when reusing this function (likely!), we'll might to make sure that the PTE bits match as well.

See below regarding uffd-wp.

#endif /* CONFIG_MMU */
void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
diff --git a/mm/madvise.c b/mm/madvise.c
index 1f77a51baaac..070bedb4996e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
struct folio *folio;
int nr_swap = 0;
unsigned long next;
+ int nr, max_nr;
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd))
@@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
- for (; addr != end; pte++, addr += PAGE_SIZE) {
+ for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
+ nr = 1;
ptent = ptep_get(pte);
if (pte_none(ptent))
@@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
entry = pte_to_swp_entry(ptent);
if (!non_swap_entry(entry)) {
- nr_swap--;
- free_swap_and_cache(entry);
- pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = swap_pte_batch(pte, max_nr, entry);
+ nr_swap -= nr;
+ free_swap_and_cache_nr(entry, nr);
+ clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
} else if (is_hwpoison_entry(entry) ||
is_poisoned_swp_entry(entry)) {
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
diff --git a/mm/memory.c b/mm/memory.c
index 7dc6c3d9fa83..ef2968894718 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
folio_remove_rmap_pte(folio, page, vma);
folio_put(folio);
} else if (!non_swap_entry(entry)) {
- /* Genuine swap entry, hence a private anon page */
+ max_nr = (end - addr) / PAGE_SIZE;
+ nr = swap_pte_batch(pte, max_nr, entry);
+ /* Genuine swap entries, hence a private anon pages */
if (!should_zap_cows(details))
continue;
- rss[MM_SWAPENTS]--;
- if (unlikely(!free_swap_and_cache(entry)))
- print_bad_pte(vma, addr, ptent, NULL);
+ rss[MM_SWAPENTS] -= nr;
+ free_swap_and_cache_nr(entry, nr);
} else if (is_migration_entry(entry)) {
folio = pfn_swap_entry_folio(entry);
if (!should_zap_folio(details, folio))
@@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
WARN_ON_ONCE(1);
}
- pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
- zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
+ clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);

For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.

zap_install_uffd_wp_if_needed() will use the uffd-wp information in ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.

On mixture, you either lose some or place too many markers.

A simple workaround would be to disable any such batching if the VMA does have uffd-wp enabled.

+ zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
add_mm_rss_vec(mm, rss);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0d44ee2b4f9c..d059de6896c1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
/* Reclaim the swap entry if swap is getting full*/
#define TTRS_FULL 0x4
-/* returns 1 if swap entry is freed */
+/*
+ * returns number of pages in the folio that backs the swap entry. If positive,
+ * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
+ * folio was associated with the swap entry.
+ */
static int __try_to_reclaim_swap(struct swap_info_struct *si,
unsigned long offset, unsigned long flags)
{
@@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
ret = folio_free_swap(folio);
folio_unlock(folio);
}
+ ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
folio_put(folio);
return ret;
}
@@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
spin_lock(&si->lock);
/* entry was freed successfully, try to use this again */
- if (swap_was_freed)
+ if (swap_was_freed > 0)
goto checks;
goto scan; /* check next one */
}
@@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio)
return true;
}
-/*
- * Free the swap entry like above, but also try to
- * free the page cache entry if it is the last user.
- */
-int free_swap_and_cache(swp_entry_t entry)

Can we have some documentation what this function expects? How does nr relate to entry?

i.e., offset range defined by [entry.offset, entry.offset + nr).

+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
{
- struct swap_info_struct *p;

It might be easier to get if you do

const unsigned long start_offset = swp_offset(entry);
const unsigned long end_offset = start_offset + nr;

+ unsigned long end = swp_offset(entry) + nr;
+ unsigned int type = swp_type(entry);
+ struct swap_info_struct *si;
+ bool any_only_cache = false;
+ unsigned long offset;
unsigned char count;
if (non_swap_entry(entry))
- return 1;
+ return;
- p = get_swap_device(entry);
- if (p) {
- if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
- put_swap_device(p);
- return 0;
+ si = get_swap_device(entry);
+ if (!si)
+ return;
+
+ if (WARN_ON(end > si->max))
+ goto out;
+
+ /*
+ * First free all entries in the range.
+ */
+ for (offset = swp_offset(entry); offset < end; offset++) {
+ if (!WARN_ON(data_race(!si->swap_map[offset]))) {

Ouch "!(!)). Confusing.

I'm sure there is a better way to write that, maybe using more lines

if (data_race(si->swap_map[offset])) {
...
} else {
WARN_ON_ONCE(1);
}

+ count = __swap_entry_free(si, swp_entry(type, offset));
+ if (count == SWAP_HAS_CACHE)
+ any_only_cache = true;
}
+ }
+
+ /*
+ * Short-circuit the below loop if none of the entries had their
+ * reference drop to zero.
+ */
+ if (!any_only_cache)
+ goto out;
- count = __swap_entry_free(p, entry);
- if (count == SWAP_HAS_CACHE)
- __try_to_reclaim_swap(p, swp_offset(entry),
+ /*
+ * Now go back over the range trying to reclaim the swap cache. This is
+ * more efficient for large folios because we will only try to reclaim
+ * the swap once per folio in the common case. If we do
+ * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+ * latter will get a reference and lock the folio for every individual
+ * page but will only succeed once the swap slot for every subpage is
+ * zero.
+ */
+ for (offset = swp_offset(entry); offset < end; offset += nr) {
+ nr = 1;
+ if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {

Here we use READ_ONCE() only, above data_race(). Hmmm.

+ /*
+ * Folios are always naturally aligned in swap so
+ * advance forward to the next boundary. Zero means no
+ * folio was found for the swap entry, so advance by 1
+ * in this case. Negative value means folio was found
+ * but could not be reclaimed. Here we can still advance
+ * to the next boundary.
+ */
+ nr = __try_to_reclaim_swap(si, offset,
TTRS_UNMAPPED | TTRS_FULL);
- put_swap_device(p);
+ if (nr == 0)
+ nr = 1;
+ else if (nr < 0)
+ nr = -nr;
+ nr = ALIGN(offset + 1, nr) - offset;
+ }

Apart from that nothing jumped at me.

--
Cheers,

David / dhildenb

Next message: Honggyu Kim: "Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory"
Previous message: Przemek Kitszel: "Re: [PATCH net-next v9 4/9] slab: introduce kvmalloc_array_node() and kvcalloc_node()"
In reply to: Ryan Roberts: "[PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()"
Next in thread: Ryan Roberts: "[PATCH v6 3/6] mm: swap: Simplify struct percpu_cluster"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]