Re: [REGRESSION][BISECTED] Crash with Bad page state for FUSE/Flatpak related applications since v6.13

From: Josef Bacik
Date: Fri Feb 07 2025 - 12:29:31 EST


On Fri, Feb 07, 2025 at 05:49:34PM +0100, Vlastimil Babka wrote:
> On 2/7/25 10:34, Miklos Szeredi wrote:
> > [Adding Joanne, Willy and linux-mm].
> >
> >
> > On Thu, 6 Feb 2025 at 11:54, Christian Heusel <christian@xxxxxxxxx> wrote:
> >>
> >> Hello everyone,
> >>
> >> we have recently received [a report][0] on the Arch Linux Gitlab about
> >> multiple users having system crashes when using Flatpak programs and
> >> related FUSE errors in their dmesg logs.
> >>
> >> We have subsequently bisected the issue within the mainline kernel tree
> >> to the following commit:
> >>
> >> 3eab9d7bc2f4 ("fuse: convert readahead to use folios")
>
> I see that commit removes folio_put() from fuse_readpages_end(). Also it now
> uses readahead_folio() in fuse_readahead() which does folio_put(). So that's
> suspicious to me. It might be storing pointers to pages to ap->pages without
> pinning them with a refcount.
>
> But I don't understand the code enough to know what's the proper fix. A
> probably stupid fix would be to use __readahead_folio() instead and keep the
> folio_put() in fuse_readpages_end().

Agreed, I'm also confused as to what the right thing is here. It appears the
rules are "if the folio is locked, nobody messes with it", so it's not "correct"
to hold a reference on the folio while it's being read. I don't love this way
of dealing with folios, but that seems to be the way it's always worked.

I went and looked at a few of the other file systems and we have NFS which holds
it's own reference to the folio while the IO is outstanding, which FUSE is most
similar to NFS so this would make sense to do.

Btrfs however doesn't do this, BUT we do set_folio_private (or whatever it's
called) so that keeps us from being reclaimed since we'll try to lock the folio
before we do the reclaim.

So perhaps that's the issue here? We need to have a private on the folio + the
folio locked to make sure it doesn't get reclaimed while it's out being read?

I'm knee deep in other things, if we want a quick fix then I think your
suggestion is correct Vlastimil. But I definitely want to know what Willy
expects to be the proper order of operations here, and if this is exactly what
we're supposed to be doing then something else is going wrong and we should try
to reproduce locally and figure out what's happening. Thanks,

Josef