Re: [syzbot] [mm?] [fs?] BUG: sleeping function called from invalid context in folio_mc_copy

From: Luis Chamberlain
Date: Sat Mar 29 2025 - 22:27:01 EST


On Sat, Mar 29, 2025 at 10:05:34PM -0400, Rik van Riel wrote:
> On Thu, 2025-03-27 at 14:42 -0700, Luis Chamberlain wrote:
> > On Thu, Mar 27, 2025 at 09:26:41AM -0700, syzbot wrote:
> > > Hello,
> >
> > Thanks, this is a known issue and we're having a hard time
> > reproducing [0].
> >
> > > C reproducer:  
> > > https://syzkaller.appspot.com/x/repro.c?x=152d4de4580000
> >
> > Thanks! Sadly this has not yet been able to let me reprodouce the
> > issue,
> > and so we're trying to come up with other ways to test the imminent
> > spin
> > lock + sleep on buffer_migrate_folio_norefs() path different ways
> > now,
> > including a new fstests [1] but no luck yet.
>
> The backtrace in the report seems to make the cause
> of the bug fairly clear, though.
>
> The function folio_mc_copy() can sleep.
>
> The function __buffer_migrate_folio() calls
> filemap_migrate_folio() with a spinlock held.
>
> That function eventually calls folio_mc_copy():
>
> __might_resched+0x5d4/0x780 kernel/sched/core.c:8764
> folio_mc_copy+0x13c/0x1d0 mm/util.c:742
> __migrate_folio mm/migrate.c:758 [inline]
> filemap_migrate_folio+0xb4/0x4c0 mm/migrate.c:943
> __buffer_migrate_folio+0x3ec/0x5d0 mm/migrate.c:874
> move_to_new_folio+0x2ac/0xc20 mm/migrate.c:1050
> migrate_folio_move mm/migrate.c:1358 [inline]
> migrate_folios_move mm/migrate.c:1710 [inline]
>
> The big question is how to safely release the
> spinlock in __buffer_migrate_folio() before calling
> filemap_migrate_folio()

I suggested a way in the other 0-day reported bug report as that was
the thread that started this investigation [0]. That has survived
20 hours of ext4 with generic/750, and the newly proposed generic/764 [1]
while also using a block device with large folios and runnding dd
against it in a loop.

And so now I'm going to establish an ext4 baseline with kdevops on all
ext4 profiles on linux-next, and then check to see if there are any
regressions with it.

I've localized the new check for only those that need it too.

[0] https://lkml.kernel.org/r/Z-dHqMtGneCVs3v5@xxxxxxxxxxxxxxxxxxxxxx>
[1] https://lkml.kernel.org/r/20250326185101.2237319-1-mcgrof@xxxxxxxxxx

Anwyay, below is the latest changes: