Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS

From: Matthew Wilcox

Date: Thu Oct 30 2025 - 17:25:52 EST


On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
> On 2025-10-25 12:45, Matthew Wilcox wrote:
> > No, absolutely not. We're not having open-coded GFP_NOFAIL semantics.
> > The right way forward is for ext4 to use iomap, not for buffer heads
> > to support large block sizes.
>
> ext4 only calls getblk_unmovable or __getblk when reading critical
> metadata. Both of these functions set __GFP_NOFAIL to ensure that
> metadata reads do not fail due to memory pressure.
>
> Both functions eventually call grow_dev_folio(), which is why we
> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
> has similar logic, but XFS manages its own metadata, allowing it
> to use vmalloc for memory allocation.

In today's ext4 call, we discussed various options:

1. Change folios to be potentially fragmented. This change would be
ridiculously large and nobody thinks this is a good idea. Included here
for completeness.

2. Separate the buffer cache from the page cache again. They were
unified about 25 years ago, and this also feels like a very big job.

3. Duplicate the buffer cache into ext4/jbd2, remove the functionality
not needed and make _this_ version of the buffer cache allocate
its own memory instead of aliasing into the page cache. More feasible
than 1 or 2; still quite a big job.

4. Pick up Catherine's work and make ext4/jbd2 use it. Seems to be
about an equivalent amount of work to option 3.

5. Make __GFP_NOFAIL work for allocations up to 64KiB (we decided this was
probably the practical limit of sector sizes that people actually want).
In terms of programming, it's a one-line change. But we need to sell
this change to the MM people. I think it's doable because if we have
a filesystem with 64KiB sectors, there will be many clean folios in the
pagecache which are 64KiB or larger.

So, we liked option 5 best.