Re: [REGRESSION] kswapd0: page allocation failure (bisected to "slab: add sheaves to most caches")

From: David Sterba

Date: Mon Feb 23 2026 - 15:33:59 EST

On Mon, Feb 23, 2026 at 08:59:30PM +0900, Harry Yoo wrote:
> On Mon, Feb 23, 2026 at 11:12:47AM +0000, Chris Bainbridge wrote:
> > On Mon, Feb 23, 2026 at 05:41:17PM +0900, Harry Yoo wrote:
> > > On Sun, Feb 22, 2026 at 09:36:58PM +0000, Chris Bainbridge wrote:
> > > > Hi,
> > > >
> > > > The latest mainline kernel (v6.19-11831-ga95f71ad3e2e) has page
> > > > allocation failures when doing things like compiling a kernel. I can
> > > > also reproduce this with a stress test like
> > > > `stress-ng --vm 2 --vm-bytes 110% --verify -v`
> > >
> > > Hi, thanks for the report!
> > >
> > > > [ 104.032925] kswapd0: page allocation failure: order:0, mode:0xc0c40(GFP_NOFS|__GFP_COMP|__GFP_NOMEMALLOC), nodemask=(null),cpuset=/,mems_allowed=0
> > > > [ 104.033307] CPU: 4 UID: 0 PID: 156 Comm: kswapd0 Not tainted 6.19.0-rc5-00027-g40fd0acc45d0 #435 PREEMPT(voluntary)
> > > > [ 104.033312] Hardware name: HP HP Pavilion Aero Laptop 13-be0xxx/8916, BIOS F.17 12/18/2024
> > > > [ 104.033314] Call Trace:
> > > > [ 104.033316] <TASK>
> > > > [ 104.033319] dump_stack_lvl+0x6a/0x90
> > > > [ 104.033328] warn_alloc.cold+0x95/0x1af
> > > > [ 104.033334] ? zone_watermark_ok+0x80/0x80
> > > > [ 104.033350] __alloc_frozen_pages_noprof+0xec3/0x2470
> > > > [ 104.033353] ? __lock_acquire+0x489/0x2600
> > > > [ 104.033359] ? stack_access_ok+0x1c0/0x1c0
> > > > [ 104.033367] ? warn_alloc+0x1d0/0x1d0
> > > > [ 104.033371] ? __lock_acquire+0x489/0x2600
> > > > [ 104.033375] ? _raw_spin_unlock_irqrestore+0x48/0x60
> > > > [ 104.033379] ? _raw_spin_unlock_irqrestore+0x48/0x60
> > > > [ 104.033382] ? lockdep_hardirqs_on+0x78/0x100
> > > > [ 104.033394] allocate_slab+0x2b7/0x510
> > > > [ 104.033399] refill_objects+0x25d/0x380
> > > > [ 104.033407] __pcs_replace_empty_main+0x193/0x5f0
> > > > [ 104.033412] kmem_cache_alloc_noprof+0x5b6/0x6f0
> > > > [ 104.033415] ? alloc_extent_state+0x1b/0x210 [btrfs]
> > > > [ 104.033479] alloc_extent_state+0x1b/0x210 [btrfs]
> > > > [ 104.033527] btrfs_clear_extent_bit_changeset+0x2be/0x9c0 [btrfs]
> > >
> > > Hmm while bisect points out the first bad commit is
> > > commit e47c897a2949 ("slab: add sheaves to most caches"),
> > >
> > > I think the caller is supposed to specify __GFP_NOWARN if it doesn't
> > > care about allocation failure?
> > >
> > > btrfs_clear_extent_bit_changeset() says:
> > > > if (!prealloc) {
> > > > /*
> > > > * Don't care for allocation failure here because we might end
> > > > * up not needing the pre-allocated extent state at all, which
> > > > * is the case if we only have in the tree extent states that
> > > > * cover our input range and don't cover too any other range.
> > > > * If we end up needing a new extent state we allocate it later.
> > > > */
> > > > prealloc = alloc_extent_state(mask);
> > > > }
> > >
> > > Oh wait, I see what's going on. bisection pointed out the commit
> > > because slab tries to refill sheaves with __GFP_NOMEMALLOC (and then
> > > falls back to slowpath if it fails).
> > >
> > > Since failing to refill sheaves doesn't mean the allocation will fail,
> > > it should specify __GFP_NOWARN with __GFP_NOMEMALLOC as long as there's
> > > fallback method.
> > >
> > > But for __prefill_sheaf_pfmemalloc(), it should specify __GPF_NOWARN on
> > > the first attempt only when gfp_pfmemalloc_allowed() returns true.
> >
> > Is this fix sufficient to do the right thing? I tested it, and it does
> > appear to prevent logging of the allocation failures for my test case.
>
> I think we should do both both 1) setting __GFP_NOWARN from btrfs side
> and 2) making slab try to refill sheaves with __GFP_NOWARN when
> there's a fallback path.
>
> I'm writing a fix for 2) and I'll send it soon.
>
> > diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
> > index d0dd50f7d279..d2e1083848e8 100644
> > --- a/fs/btrfs/extent-io-tree.c
> > +++ b/fs/btrfs/extent-io-tree.c
> > @@ -641,7 +641,7 @@ int btrfs_clear_extent_bit_changeset(struct extent_io_tree *tree, u64 start, u64
> > * cover our input range and don't cover too any other range.
> > * If we end up needing a new extent state we allocate it later.
> > */
> > - prealloc = alloc_extent_state(mask);
> > + prealloc = alloc_extent_state(mask | __GFP_NOWARN);
>
> This seems to be a right thing to do to me, but as I'm not familiar
> with btrfs, I'll let btrfs folks leave comment on it :)

I agree the flag should be added, as the comment explains allocation
failures are not fatal at this place. There's another call to the
alloc_extent_state() with GFP_ATOMIC so we cannot simply sink NOWARN
there.