Re: [PATCHv6 11/37] HACK: readahead: alloc huge pages, if allowed

From: Andreas Dilger
Date: Thu Feb 09 2017 - 19:24:16 EST

On Feb 9, 2017, at 4:34 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> On Thu, Jan 26, 2017 at 02:57:53PM +0300, Kirill A. Shutemov wrote:
>> Most page cache allocation happens via readahead (sync or async), so if
>> we want to have significant number of huge pages in page cache we need
>> to find a ways to allocate them from readahead.
>> Unfortunately, huge pages doesn't fit into current readahead design:
>> 128 max readahead window, assumption on page size, PageReadahead() to
>> track hit/miss.
>> I haven't found a ways to get it right yet.
>> This patch just allocates huge page if allowed, but doesn't really
>> provide any readahead if huge page is allocated. We read out 2M a time
>> and I would expect spikes in latancy without readahead.
>> Therefore HACK.
>> Having that said, I don't think it should prevent huge page support to
>> be applied. Future will show if lacking readahead is a big deal with
>> huge pages in page cache.
>> Any suggestions are welcome.
> Well ... what if we made readahead 2 hugepages in size for inodes which
> are using huge pages? That's only 8x our current readahead window, and
> if you're asking for hugepages, you're accepting that IOs are going to
> be larger, and you probably have the kind of storage system which can
> handle doing larger IOs.

It would be nice if the bdi had a parameter for the maximum readahead size.
Currently, readahead is capped at 2MB chunks by force_page_cache_readahead()
even if bdi->ra_pages and bdi->io_pages are much larger.

It should be up to the filesystem to decide how large the readahead chunks
are rather than imposing some policy in the MM code. For high-speed (network)
storage access it is better to have at least 4MB read chunks, for RAID storage
it is desirable to have stripe-aligned readahead to avoid read inflation when
verifying the parity. Any fixed size will eventually be inadequate as disks
and filesystems change, so it may as well be a per-bdi tunable that can be set
by the filesystem as needed, or possibly with a mount option if needed.

Cheers, Andreas

Attachment: signature.asc
Description: Message signed with OpenPGP