Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

From: Linus Torvalds
Date: Sun Oct 30 2022 - 21:47:55 EST


On Sun, Oct 30, 2022 at 3:47 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Anyway, this simplification patch basically means that the *next* step
> could be to just move that ipage_zap_pte_rmap()' after the TLB flush,
> and now it's trivial and no longer scary.

So here's that next patch: it's patch 4/4 in this series.

Patches 1-3 are the patches that I already sent out as one (smaller)
combined patch. I'm including them here as the whole series in case
somebody else wants to follow along with how I did the simplified
version of page_remove_rmap().

So if you already looked at the previous patch and don't have any
questions about that one, you could just apply PATCH 4/4 on top of
that one.

Or you can do the whole series with commit messages and (hopefully)
clearer individual steps to the simplified version of
page_remove_rmap().

I haven't actually tested 4/4 yet, so this is yet another of my "I
think this should work" patches.

The reason I haven't actually tested it is partly because I never
recreated the original problem Navav reported, and partly because the
meat of patch 4/4 is just the same "encode an extra flag bit in the
low bit of the page pointer" that I _did_ test, just doing the "remove
rmap" instead of "set dirty".

In other words, I *think* this should make Nadav's test-case happy,
and avoid the warning he saw.

If somebody wants a git branch, I guess I can push that too, but it
would be a non-stable branch only for testing.

Also, it's worth noting that zap_pte_range() does this sanity test:

if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);

and that is likely worthless now (because it hasn't actually
decremented the mapcount yet). I didn't remove it, because I wasn't
sure which option was best:

(a) just remove it entirely

(b) change the "< 0" to "<= 0"

(c) move it to clean_and_free_pages_and_swap_cache() that actually
does the page_zap_pte_rmap() now.

so I'm in no way claiming this series is any kind of final word, but I
do mostly like how the code ended up looking.

Now, this whole "remove rmap after flush" would probably make sense
for some of the largepage cases too, and there's room for another bit
in there (or more, if you align 'struct page') and that whole code
could use page size hints etc too. But I suspect that the main case we
really care is for individual small pages, just because that's the
case where I think it would be much worse to do any fancy TLB
tracking. The largepage cases presumably aren't as critical, since
there by definition is fewer of those.

Comments?

Linus
From aeea35b14fa697ab4e5aabc03915d954cdbedaf8 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sun, 30 Oct 2022 13:26:07 -0700
Subject: [PATCH 1/4] mm: introduce simplified versions of 'page_remove_rmap()'

The rmap handling is proving a bit problematic, and part of it comes
from the complexities of all the different cases of our implementation
of 'page_remove_rmap()'.

And a large part of that complexity comes from the fact that while we
have multiple different versions of _adding_ an rmap, this 'remove rmap'
function tries to deal with all possible cases.

So we have these specific versions for page_add_anon_rmap(),
page_add_new_anon_rmap() and page_add_file_rmap() which all do slightly
different things, but then 'page_remove_rmap()' has to handle all the
cases.

That's particularly annoying for 'zap_pte_range()', which already knows
which special case it's dealing with. It already checked for its own
reasons whether it's an anonymous page, and it already knows it's not
the compound page case and passed in an unconditional 'false' argument.

So this introduces the specialized versions of 'page_remove_rmap()' for
the cases that zap_pte_range() wants. We also make it the job of the
caller to do the munlock_vma_page(), which is really unrelated and is
the only thing that cares aboiut the 'vma'.

This just means that we end up with several simplifications:

- there's no 'vma' argument any more, because it's not used

- there's no 'compund' argument any more, because it was always false

- we can get rid of the tests for 'compund' and 'PageAnon()' since we
know that they are

and so instead of having that fairly complicated page_remove_rmap()
function, we end up with a couple of specialized functions that are
_much_ simpler.

There is supposed to be no semantic difference from this change,
although this does end up simplifying the code further by moving the
atomic_add_negative() on the PageAnon mapcount to outside the memcg
locking.

That locking protects other data structures (the page state statistics),
and this avoids not only an ugly 'goto', but means that we don't need to
take and release the lock when we're not actually doing anything with
the state statistics.

We also remove the test for PageTransCompound(), since this is only
called for the final pte level from zap_pte_range().

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
include/linux/rmap.h | 2 ++
mm/memory.c | 6 ++++--
mm/rmap.c | 42 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..8d29b7c38368 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,6 +196,8 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address);
void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void page_zap_file_rmap(struct page *);
+void page_zap_anon_rmap(struct page *);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);

diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..ba1d08a908a4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1451,9 +1451,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_young(ptent) &&
likely(!(vma->vm_flags & VM_SEQ_READ)))
mark_page_accessed(page);
- }
+ page_zap_file_rmap(page);
+ } else
+ page_zap_anon_rmap(page);
+ munlock_vma_page(page, vma, false);
rss[mm_counter(page)]--;
- page_remove_rmap(page, vma, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(__tlb_remove_page(tlb, page))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..71a5365f23f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1412,6 +1412,48 @@ static void page_remove_anon_compound_rmap(struct page *page)
__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
}

+/**
+ * page_zap_file_rmap - take down non-anon pte mapping from a page
+ * @page: page to remove mapping from
+ *
+ * This is the simplified form of page_remove_rmap(), with:
+ * - we've already checked for '!PageAnon(page)'
+ * - 'compound' is always false
+ * - the caller does 'munlock_vma_page(page, vma, compound)' separately
+ * which allows for a much simpler calling convention.
+ *
+ * The caller holds the pte lock.
+ */
+void page_zap_file_rmap(struct page *page)
+{
+ lock_page_memcg(page);
+ page_remove_file_rmap(page, false);
+ unlock_page_memcg(page);
+}
+
+/**
+ * page_zap_anon_rmap(page) - take down non-anon pte mapping from a page
+ * @page: page to remove mapping from
+ *
+ * This is the simplified form of page_remove_rmap(), with:
+ * - we've already checked for 'PageAnon(page)'
+ * - 'compound' is always false
+ * - the caller does 'munlock_vma_page(page, vma, compound)' separately
+ * which allows for a much simpler calling convention.
+ *
+ * The caller holds the pte lock.
+ */
+void page_zap_anon_rmap(struct page *page)
+{
+ /* page still mapped by someone else? */
+ if (!atomic_add_negative(-1, &page->_mapcount))
+ return;
+
+ lock_page_memcg(page);
+ __dec_lruvec_page_state(page, NR_ANON_MAPPED);
+ unlock_page_memcg(page);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
--
2.37.1.289.g45aa1e5c72.dirty

From 79c23c212f9e21edb2dbb440dd499d0a49e79bea Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sun, 30 Oct 2022 13:50:39 -0700
Subject: [PATCH 2/4] mm: inline simpler case of page_remove_file_rmap()

Now that we have a simplified special case of 'page_remove_rmap()' that
doesn't deal with the 'compound' case and always gets a file-mapped (ie
not anonymous) page, it ended up doing just

lock_page_memcg(page);
page_remove_file_rmap(page, false);
unlock_page_memcg(page);

but 'page_remove_file_rmap()' is actually trivial when 'compound' is false.

So just inline that non-compound case in the caller, and - like we did
in the previous commit for the anon pages - only do the memcg locking for
the parts that actually matter: the page statistics.

Also, as the previous commit did for anonymous pages, knowing we only
get called for the last-level page table entries allows for a further
simplification: we can get rid of the 'PageHuge(page)' case too.

You can't map a huge-page in a pte without splitting it (and the full
code in the generic page_remove_file_rmap() function has a comment to
that effect: "hugetlb pages are always mapped with pmds").

That means that the page_zap_file_rmap() case of that whole function is
really small and trivial.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
mm/rmap.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 71a5365f23f3..69de6c833d5c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1426,8 +1426,11 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_zap_file_rmap(struct page *page)
{
+ if (!atomic_add_negative(-1, &page->_mapcount))
+ return;
+
lock_page_memcg(page);
- page_remove_file_rmap(page, false);
+ __dec_lruvec_page_state(page, NR_FILE_MAPPED);
unlock_page_memcg(page);
}

--
2.37.1.289.g45aa1e5c72.dirty

From 25d9e6a9b37e573390af2e3f6c1db429d8ddb4ad Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sun, 30 Oct 2022 15:14:43 -0700
Subject: [PATCH 3/4] mm: re-unify the simplified page_zap_*_rmap() function

Now that we've simplified both the anonymous and file-backed opage zap
functions, they end up being identical except for which page statistic
they update, and we can re-unify the implementation of that much
simplified code.

To make it very clear that this is onlt for the final pte zapping (since
a lot of the simplifications depended on that), name the unified
function 'page_zap_pte_rmap()'.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
include/linux/rmap.h | 3 +--
mm/memory.c | 5 ++---
mm/rmap.c | 39 +++++++++------------------------------
3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8d29b7c38368..f62af001707c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -196,8 +196,7 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address);
void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
-void page_zap_file_rmap(struct page *);
-void page_zap_anon_rmap(struct page *);
+void page_zap_pte_rmap(struct page *);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);

diff --git a/mm/memory.c b/mm/memory.c
index ba1d08a908a4..c893f5ffc5a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1451,9 +1451,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (pte_young(ptent) &&
likely(!(vma->vm_flags & VM_SEQ_READ)))
mark_page_accessed(page);
- page_zap_file_rmap(page);
- } else
- page_zap_anon_rmap(page);
+ }
+ page_zap_pte_rmap(page);
munlock_vma_page(page, vma, false);
rss[mm_counter(page)]--;
if (unlikely(page_mapcount(page) < 0))
diff --git a/mm/rmap.c b/mm/rmap.c
index 69de6c833d5c..28b51a31ebb0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1413,47 +1413,26 @@ static void page_remove_anon_compound_rmap(struct page *page)
}

/**
- * page_zap_file_rmap - take down non-anon pte mapping from a page
+ * page_zap_pte_rmap - take down a pte mapping from a page
* @page: page to remove mapping from
*
- * This is the simplified form of page_remove_rmap(), with:
- * - we've already checked for '!PageAnon(page)'
- * - 'compound' is always false
- * - the caller does 'munlock_vma_page(page, vma, compound)' separately
- * which allows for a much simpler calling convention.
+ * This is the simplified form of page_remove_rmap(), that only
+ * deals with last-level pages, so 'compound' is always false,
+ * and the caller does 'munlock_vma_page(page, vma, compound)'
+ * separately.
*
- * The caller holds the pte lock.
- */
-void page_zap_file_rmap(struct page *page)
-{
- if (!atomic_add_negative(-1, &page->_mapcount))
- return;
-
- lock_page_memcg(page);
- __dec_lruvec_page_state(page, NR_FILE_MAPPED);
- unlock_page_memcg(page);
-}
-
-/**
- * page_zap_anon_rmap(page) - take down non-anon pte mapping from a page
- * @page: page to remove mapping from
- *
- * This is the simplified form of page_remove_rmap(), with:
- * - we've already checked for 'PageAnon(page)'
- * - 'compound' is always false
- * - the caller does 'munlock_vma_page(page, vma, compound)' separately
- * which allows for a much simpler calling convention.
+ * This allows for a much simpler calling convention and code.
*
* The caller holds the pte lock.
*/
-void page_zap_anon_rmap(struct page *page)
+void page_zap_pte_rmap(struct page *page)
{
- /* page still mapped by someone else? */
if (!atomic_add_negative(-1, &page->_mapcount))
return;

lock_page_memcg(page);
- __dec_lruvec_page_state(page, NR_ANON_MAPPED);
+ __dec_lruvec_page_state(page,
+ PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
unlock_page_memcg(page);
}

--
2.37.1.289.g45aa1e5c72.dirty

From 552d121375f88ba4db55460cd378c9150f994945 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sat, 29 Oct 2022 11:45:07 -0700
Subject: [PATCH 4/4] mm: delay rmap removal until after TLB flush

When we remove a page table entry, we are very careful to only free the
page after we have flushed the TLB, because other CPUs could still be
using the page through stale TLB entries until after the flush.

However, we have removed the rmap entry for that page early, which means
that functions like folio_mkclean() would end up not serializing with
the page table lock because the page had already been made invisible to
rmap.

And that is a problem, because while the TLB entry exists, we could end
up with the followign situation:

(a) one CPU could come in and clean it, never seeing our mapping of
the page

(b) another CPU could continue to use the stale and dirty TLB entry
and continue to write to said page

resulting in a page that has been dirtied, but then marked clean again,
all while another CPU might have dirtied it some more.

End result: possibly lost dirty data.

This commit uses the same old TLB gather array that we use to delay the
freeing of the page to also say 'remove from rmap after flush', so that
we can keep the rmap entries alive until all TLB entries have been
flushed.

NOTE! While the "possibly lost dirty data" sounds catastrophic, for this
all to happen you need to have a user thread doing either madvise() with
MADV_DONTNEED or a full re-mmap() of the area concurrently with another
thread continuing to use said mapping.

So arguably this is about user space doing crazy things, but from a VM
consistency standpoint it's better if we track the dirty bit properly
even then.

Reported-by: Nadav Amit <nadav.amit@xxxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
include/asm-generic/tlb.h | 36 +++++++++++++++++++++++++++++-----
mm/memory.c | 3 +--
mm/mmu_gather.c | 41 +++++++++++++++++++++++++++++++++++----
mm/rmap.c | 4 +---
4 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 492dce43236e..a5c9c9989fd2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -238,11 +238,37 @@ extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
*/
#define MMU_GATHER_BUNDLE 8

+/*
+ * Fake type for an encoded page with flag bits in the low bits.
+ *
+ * Right now just one bit, but we could have more depending on the
+ * alignment of 'struct page'.
+ */
+struct encoded_page;
+#define TLB_ZAP_RMAP 1ul
+#define ENCODE_PAGE_BITS (TLB_ZAP_RMAP)
+
+static inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
+{
+ flags &= ENCODE_PAGE_BITS;
+ return (struct encoded_page *)(flags | (unsigned long)page);
+}
+
+static inline bool encoded_page_flags(struct encoded_page *page)
+{
+ return ENCODE_PAGE_BITS & (unsigned long)page;
+}
+
+static inline struct page *encoded_page_ptr(struct encoded_page *page)
+{
+ return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+}
+
struct mmu_gather_batch {
struct mmu_gather_batch *next;
unsigned int nr;
unsigned int max;
- struct page *pages[];
+ struct encoded_page *encoded_pages[];
};

#define MAX_GATHER_BATCH \
@@ -257,7 +283,7 @@ struct mmu_gather_batch {
#define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH)

extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
- int page_size);
+ int page_size, unsigned int flags);
#endif

/*
@@ -431,13 +457,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
{
- if (__tlb_remove_page_size(tlb, page, page_size))
+ if (__tlb_remove_page_size(tlb, page, page_size, 0))
tlb_flush_mmu(tlb);
}

-static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+static inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
{
- return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
+ return __tlb_remove_page_size(tlb, page, PAGE_SIZE, flags);
}

/* tlb_remove_page
diff --git a/mm/memory.c b/mm/memory.c
index c893f5ffc5a8..230946536115 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1452,12 +1452,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
likely(!(vma->vm_flags & VM_SEQ_READ)))
mark_page_accessed(page);
}
- page_zap_pte_rmap(page);
munlock_vma_page(page, vma, false);
rss[mm_counter(page)]--;
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
- if (unlikely(__tlb_remove_page(tlb, page))) {
+ if (unlikely(__tlb_remove_page(tlb, page, TLB_ZAP_RMAP))) {
force_flush = 1;
addr += PAGE_SIZE;
break;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index add4244e5790..587873e5984c 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -9,6 +9,7 @@
#include <linux/rcupdate.h>
#include <linux/smp.h>
#include <linux/swap.h>
+#include <linux/rmap.h>

#include <asm/pgalloc.h>
#include <asm/tlb.h>
@@ -43,12 +44,43 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
return true;
}

+/*
+ * We get an 'encoded page' array, which has page pointers with
+ * encoded flags in the low bits of the array.
+ *
+ * The TLB has been flushed, now we need to react to the flag bits
+ * the 'struct page', clean the array in-place, and then free the
+ * pages and their swap cache.
+ */
+static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
+{
+ for (unsigned int i = 0; i < nr; i++) {
+ struct encoded_page *encoded = pages[i];
+ unsigned int flags = encoded_page_flags(encoded);
+ if (flags) {
+ /* Clean the flagged pointer in-place */
+ struct page *page = encoded_page_ptr(encoded);
+ pages[i] = encode_page(page, 0);
+
+ /* The flag bit being set means that we should zap the rmap */
+ page_zap_pte_rmap(page);
+ }
+ }
+
+ /*
+ * Now all entries have been un-encoded, and changed to plain
+ * page pointers, so we can cast the 'encoded_page' array to
+ * a plain page array and free them
+ */
+ free_pages_and_swap_cache((struct page **)pages, nr);
+}
+
static void tlb_batch_pages_flush(struct mmu_gather *tlb)
{
struct mmu_gather_batch *batch;

for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct page **pages = batch->pages;
+ struct encoded_page **pages = batch->encoded_pages;

do {
/*
@@ -56,7 +88,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
*/
unsigned int nr = min(512U, batch->nr);

- free_pages_and_swap_cache(pages, nr);
+ clean_and_free_pages_and_swap_cache(pages, nr);
pages += nr;
batch->nr -= nr;

@@ -77,11 +109,12 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
}

-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size, unsigned int flags)
{
struct mmu_gather_batch *batch;

VM_BUG_ON(!tlb->end);
+ VM_BUG_ON(flags & ~ENCODE_PAGE_BITS);

#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
VM_WARN_ON(tlb->page_size != page_size);
@@ -92,7 +125,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->pages[batch->nr++] = page;
+ batch->encoded_pages[batch->nr++] = encode_page(page, flags);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
diff --git a/mm/rmap.c b/mm/rmap.c
index 28b51a31ebb0..416b7078b75f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1422,8 +1422,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
* separately.
*
* This allows for a much simpler calling convention and code.
- *
- * The caller holds the pte lock.
*/
void page_zap_pte_rmap(struct page *page)
{
@@ -1431,7 +1429,7 @@ void page_zap_pte_rmap(struct page *page)
return;

lock_page_memcg(page);
- __dec_lruvec_page_state(page,
+ dec_lruvec_page_state(page,
PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
unlock_page_memcg(page);
}
--
2.37.1.289.g45aa1e5c72.dirty