Re: [PATCH] shmem: avoid huge pages for small files

From: Kirill A. Shutemov
Date: Fri Oct 21 2016 - 19:33:50 EST


On Sat, Oct 22, 2016 at 09:50:13AM +1100, Dave Chinner wrote:
> On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > > behaviour that should either just work automatically (i.e. be
> > > > > completely controlled by the filesystem) or only be applied to files
> > > >
> > > > Can you explain what you mean? How would the file system control it?
> > >
> > > There's no point in asking for huge pages when populating the page
> > > cache if the file is:
> > >
> > > - significantly smaller than the huge page size
> > > - largely sparse
> > > - being randomly accessed in small chunks
> > > - badly fragmented and so takes hundreds of IO to read/write
> > > a huge page
> > > - able to optimise delayed allocation to match huge page
> > > sizes and alignments
> > >
> > > These are all constraints the filesystem knows about, but the
> > > application and user don't.
> >
> > Really?
> >
> > To me, most of things you're talking about is highly dependent on access
> > pattern generated by userspace:
> >
> > - we may want to allocate huge pages from byte 1 if we know that file
> > will grow;
>
> delayed allocation takes care of that. We use a growing speculative
> delalloc size that kicks in at specific sizes and can be used
> directly to determine if a large page shoul dbe allocated. This code
> is aware of sparse files, sparse writes, etc.

I'm confused here. How can we delay allocation of page cache?

Delalloc is helpful to have reasonable on-disk layout, but my
understanding is that it uses page cache as buffering to postpone
block allocation. Later on writeback we see access pattern using pages
from page cache.

I'm likely missing something important here. Hm?

> > - it will be beneficial to allocate huge page even for fragmented files,
> > if it's read-mostly;
>
> No, no it won't. The IO latency impact here can be massive.
> read-ahead of single 4k pages hides most of this latency from the
> application, but with a 2MB page, we can't use readhead to hide this
> IO latency because the first access could stall for hundreds of
> small random read IOs to be completed instead of just 1.

I agree that it will lead to initial latency spike. But don't we have
workloads which would tolerate it to get faster hot-cache behaviour?

> > > Further, we are moving the IO path to a model where we use extents
> > > for mapping, not blocks. We're optimising for the fact that modern
> > > filesystems use extents and so massively reduce the number of block
> > > mapping lookup calls we need to do for a given IO.
> > >
> > > i.e. instead of doing "get page, map block to page" over and over
> > > again until we've alked over the entire IO range, we're doing
> > > "map extent for entire IO range" once, then iterating "get page"
> > > until we've mapped the entire range.
> >
> > That's great, but it's not how IO path works *now*. And will takes a long
> > time (if ever) to flip it over to what you've described.
>
> Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
> right now, GFS2 will use parts of it in the next release, DAX
> already uses it and PMD support in DAX is being built on top of it.

That's interesting. I've managed to miss whole fs/iomap.c thing...

> > > As such, there is no way we should be considering different
> > > interfaces and methods for configuring the /same functionality/ just
> > > because DAX is enabled or not. It's the /same decision/ that needs
> > > to be made, and the filesystem knows an awful lot more about whether
> > > huge pages can be used efficiently at the time of access than just
> > > about any other actor you can name....
> >
> > I'm not convinced that filesystem is in better position to see access
> > patterns than mm for page cache. It's not all about on-disk layout.
>
> Spoken like a true mm developer.

Guilty.

--
Kirill A. Shutemov