Re: [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead

From: Matthew Wilcox
Date: Mon Mar 25 2024 - 15:00:42 EST


On Wed, Mar 13, 2024 at 06:02:47PM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@xxxxxxxxxxx>
>
> page_cache_ra_unbounded() was allocating single pages (0 order folios)
> if there was no folio found in an index. Allocate mapping_min_order folios
> as we need to guarantee the minimum order if it is set.
> When read_pages() is triggered and if a page is already present, check
> for truncation and move the ractl->_index by mapping_min_nrpages if that
> folio was truncated. This is done to ensure we keep the alignment
> requirement while adding a folio to the page cache.
>
> page_cache_ra_order() tries to allocate folio to the page cache with a
> higher order if the index aligns with that order. Modify it so that the
> order does not go below the min_order requirement of the page cache.

This paragraph doesn't make sense. We have an assertion that there's no
folio in the page cache with a lower order than the minimum, so this
seems to be describing a situation that can't happen. Does it need to
be rephrased (because you're actually describing something else) or is
it just stale?

> @@ -239,23 +258,35 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> * not worth getting one just for that.
> */
> read_pages(ractl);
> - ractl->_index += folio_nr_pages(folio);
> +
> + /*
> + * Move the ractl->_index by at least min_pages
> + * if the folio got truncated to respect the
> + * alignment constraint in the page cache.
> + *
> + */
> + if (mapping != folio->mapping)
> + nr_pages = min_nrpages;
> +
> + VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> + ractl->_index += nr_pages;
> i = ractl->_index + ractl->_nr_pages - index;
> continue;
> }
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask,
> + mapping_min_folio_order(mapping));
> if (!folio)
> break;
> if (filemap_add_folio(mapping, folio, index + i,
> gfp_mask) < 0) {
> folio_put(folio);
> read_pages(ractl);
> - ractl->_index++;
> + ractl->_index += min_nrpages;

Hah, you changed this here. Please move into previous patch.

> i = ractl->_index + ractl->_nr_pages - index;
> continue;
> }
> - if (i == nr_to_read - lookahead_size)
> + if (i == mark)
> folio_set_readahead(folio);
> ractl->_workingset |= folio_test_workingset(folio);
> ractl->_nr_pages += folio_nr_pages(folio);
> @@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
> {
> struct address_space *mapping = ractl->mapping;
> pgoff_t index = readahead_index(ractl);
> + unsigned int min_order = mapping_min_folio_order(mapping);
> pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> pgoff_t mark = index + ra->size - ra->async_size;
> int err = 0;
> gfp_t gfp = readahead_gfp_mask(mapping);
> + unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>
> - if (!mapping_large_folio_support(mapping) || ra->size < 4)
> + /*
> + * Fallback when size < min_nrpages as each folio should be
> + * at least min_nrpages anyway.
> + */
> + if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> goto fallback;
>
> limit = min(limit, index + ra->size - 1);
> @@ -505,9 +542,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
> new_order = MAX_PAGECACHE_ORDER;
> while ((1 << new_order) > ra->size)
> new_order--;
> + if (new_order < min_order)
> + new_order = min_order;

I think these are the two lines you're describing in the paragraph that
doesn't make sense to me? And if so, I think they're useless?

> @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> if (index & ((1UL << order) - 1))
> order = __ffs(index);
> /* Don't allocate pages past EOF */
> - while (index + (1UL << order) - 1 > limit)
> + while (order > min_order && index + (1UL << order) - 1 > limit)
> order--;

This raises an interesting question that I don't know if we have a test
for. POSIX says that if we mmap, let's say, the first 16kB of a 10kB
file, then we can store into offset 0-12287, but stores to offsets
12288-16383 get a signal (I forget if it's SEGV or BUS). Thus far,
we've declined to even create folios in the page cache that would let us
create PTEs for offset 12288-16383, so I haven't paid too much attention
to this. Now we're going to have folios that extend into that range, so
we need to be sure that when we mmap(), we only create PTEs that go as
far as 12287.

Can you check that we have such an fstest, and that we still pass it
with your patches applied and a suitably large block size?