Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)
From: Matthew Wilcox
Date: Fri Sep 13 2024 - 17:30:34 EST
On Fri, Sep 13, 2024 at 02:24:02PM -0700, Linus Torvalds wrote:
> On Fri, 13 Sept 2024 at 11:15, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > Oh! I think split is the key. Let's say we have an order-6 (or
> > larger) folio. And we call split_huge_page() (whatever it's called
> > in your kernel version). That calls xas_split_alloc() followed
> > by xas_split(). xas_split_alloc() puts entry in node->slots[0] and
> > initialises node->slots[1..XA_CHUNK_SIZE] to a sibling entry.
>
> Hmm. The splitting does seem to be not just indicated by the debug
> logs, but it ends up being a fairly complicated case. *The* most
> complicated case of adding a new folio by far, I'd say.
>
> And I wonder if it's even necessary?
Unfortunately, we need to handle things like "we are truncating a file
which has a folio which now extends many pages beyond the end of the
file" and so we have to split the folio which now crosses EOF. Or we
could write it back and drop it, but that has its own problems.
Part of the "large block size" patches sitting in Christian's tree is
solving these problems for folios which can't be split down to order-0,
so there may be ways we can handle this better now, but if we don't
split we might end up wasting a lot of memory in file tails.
> It's possible that I'm entirely missing something, but at least the
> filemap_add_folio() case looks like it really would actually be
> happier with a "oh, that size conflicts with an existing entry, let's
> just allocate a smaller size then"
Pretty sure we already do that; it's mostly handled through the
readahead path which checks for conflicting folios already in the cache.