Re: [PATCH v1] fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
From: Dan Williams
Date: Thu Apr 10 2025 - 16:15:55 EST
David Hildenbrand wrote:
> Alison reports an issue with fsdax when large extends end up using
> large ZONE_DEVICE folios:
>
> [ 417.796271] BUG: kernel NULL pointer dereference, address: 0000000000000b00
> [ 417.796982] #PF: supervisor read access in kernel mode
> [ 417.797540] #PF: error_code(0x0000) - not-present page
> [ 417.798123] PGD 2a5c5067 P4D 2a5c5067 PUD 2a5c6067 PMD 0
> [ 417.798690] Oops: Oops: 0000 [#1] SMP NOPTI
> [ 417.799178] CPU: 5 UID: 0 PID: 1515 Comm: mmap Tainted: ...
> [ 417.800150] Tainted: [O]=OOT_MODULE
> [ 417.800583] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [ 417.801358] RIP: 0010:__lruvec_stat_mod_folio+0x7e/0x250
> [ 417.801948] Code: ...
> [ 417.803662] RSP: 0000:ffffc90002be3a08 EFLAGS: 00010206
> [ 417.804234] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
> [ 417.804984] RDX: ffffffff815652d7 RSI: 0000000000000000 RDI: ffffffff82a2beae
> [ 417.805689] RBP: ffffc90002be3a28 R08: 0000000000000000 R09: 0000000000000000
> [ 417.806384] R10: ffffea0007000040 R11: ffff888376ffe000 R12: 0000000000000001
> [ 417.807099] R13: 0000000000000012 R14: ffff88807fe4ab40 R15: ffff888029210580
> [ 417.807801] FS: 00007f339fa7a740(0000) GS:ffff8881fa9b9000(0000) knlGS:0000000000000000
> [ 417.808570] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 417.809193] CR2: 0000000000000b00 CR3: 000000002a4f0004 CR4: 0000000000370ef0
> [ 417.809925] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 417.810622] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 417.811353] Call Trace:
> [ 417.811709] <TASK>
> [ 417.812038] folio_add_file_rmap_ptes+0x143/0x230
> [ 417.812566] insert_page_into_pte_locked+0x1ee/0x3c0
> [ 417.813132] insert_page+0x78/0xf0
> [ 417.813558] vmf_insert_page_mkwrite+0x55/0xa0
> [ 417.814088] dax_fault_iter+0x484/0x7b0
> [ 417.814542] dax_iomap_pte_fault+0x1ca/0x620
> [ 417.815055] dax_iomap_fault+0x39/0x40
> [ 417.815499] __xfs_write_fault+0x139/0x380
> [ 417.815995] ? __handle_mm_fault+0x5e5/0x1a60
> [ 417.816483] xfs_write_fault+0x41/0x50
> [ 417.816966] xfs_filemap_fault+0x3b/0xe0
> [ 417.817424] __do_fault+0x31/0x180
> [ 417.817859] __handle_mm_fault+0xee1/0x1a60
> [ 417.818325] ? debug_smp_processor_id+0x17/0x20
> [ 417.818844] handle_mm_fault+0xe1/0x2b0
> [...]
>
> The issue is that when we split a large ZONE_DEVICE folio to order-0
> ones, we don't reset the order/_nr_pages. As folio->_nr_pages overlays
> page[1]->memcg_data, once page[1] is a folio, it suddenly looks like it
> has folio->memcg_data set. And we never manually initialize
> folio->memcg_data in fsdax code, because we never expect it to be set at
> all.
>
> When __lruvec_stat_mod_folio() then stumbles over such a folio, it tries to
> use folio->memcg_data (because it's non-NULL) but it does not actually
> point at a memcg, resulting in the problem.
>
> Alison also observed that these folios sometimes have "locked"
> set, which is rather concerning (folios locked from the beginning ...).
> The reason is that the order for large folios is stored in page[1]->flags,
> which become the folio->flags of a new small folio.
>
> Let's fix it by adding a folio helper to clear order/_nr_pages for
> splitting purposes.
>
> Maybe we should reinitialize other large folio flags / folio members as
> well when splitting, because they might similarly cause harm once
> page[1] becomes a folio? At least other flags in PAGE_FLAGS_SECOND should
> not be set for fsdax, so at least page[1]->flags might be as expected with
> this fix.
>
> From a quick glimpse, initializing ->mapping, ->pgmap and ->share should
> re-initialize most things from a previous page[1] used by large folios
> that fsdax cares about. For example folio->private might not get
> reinitialized, but maybe that's not relevant -- no traces of it's use in
> fsdax code. Needs a closer look.
>
> Another thing that should be considered in the future is performing similar
> checks as we perform in free_tail_page_prepare() -- checking pincount etc.
> -- when freeing a large fsdax folio.
>
> Fixes: 4996fc547f5b ("mm: let _folio_nr_pages overlay memcg_data in first tail page")
> Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
> Reported-by: Alison Schofield <alison.schofield@xxxxxxxxx>
> Closes: https://lkml.kernel.org/r/Z_W9Oeg-D9FhImf3@xxxxxxxxxxxxxxxxxx
> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: Christian Brauner <brauner@xxxxxxxxxx>
> Cc: Jan Kara <jack@xxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Alistair Popple <apopple@xxxxxxxxxx>
> Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> ---
> fs/dax.c | 1 +
> include/linux/mm.h | 17 +++++++++++++++++
> 2 files changed, 18 insertions(+)
Explanation excellent, folio_reset_order() looks correct to me and the
callsite in fsdax looks correct.
Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx>
For consistency and clarity what about this incremental change, to make
the __split_folio_to_order() path reuse folio_reset_order(), and use
typical bitfield helpers for manipulating _flags_1?
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf55206935c4..5b614d31f4f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -33,6 +33,7 @@
#include <linux/slab.h>
#include <linux/cacheinfo.h>
#include <linux/rcuwait.h>
+#include <linux/bitfield.h>
struct mempolicy;
struct anon_vma;
@@ -1171,7 +1172,7 @@ extern void prep_compound_page(struct page *page, unsigned int order);
static inline unsigned int folio_large_order(const struct folio *folio)
{
- return folio->_flags_1 & 0xff;
+ return FIELD_GET(FOLIO_ORDER_MASK, folio->_flags_1);
}
#ifdef NR_PAGES_IN_LARGE_FOLIO
@@ -1229,7 +1230,8 @@ static inline void folio_reset_order(struct folio *folio)
{
if (WARN_ON_ONCE(!folio_test_large(folio)))
return;
- folio->_flags_1 &= ~0xffUL;
+ ClearPageCompound(&folio->page);
+ folio->_flags_1 &= ~FOLIO_ORDER_MASK;
#ifdef NR_PAGES_IN_LARGE_FOLIO
folio->_nr_pages = 0;
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..3dc2d98fde24 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -483,6 +483,8 @@ struct folio {
};
};
+#define FOLIO_ORDER_MASK GENMASK(7, 0)
+
#define FOLIO_MATCH(pg, fl) \
static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
FOLIO_MATCH(flags, flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..301ca9459122 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3404,7 +3404,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
if (new_order)
folio_set_order(folio, new_order);
else
- ClearPageCompound(&folio->page);
+ folio_reset_order(folio);
}
/*
diff --git a/mm/internal.h b/mm/internal.h
index 50c2f590b2d0..41a4d2b66405 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -727,7 +727,8 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
return;
- folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
+ folio->_flags_1 &= ~FOLIO_ORDER_MASK;
+ folio->_flags_1 |= FIELD_PREP(FOLIO_ORDER_MASK, order);
#ifdef NR_PAGES_IN_LARGE_FOLIO
folio->_nr_pages = 1U << order;
#endif