Re: [PATCH] shmem: avoid huge pages for small files

From: Dave Chinner
Date: Fri Oct 21 2016 - 18:50:22 EST

On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> > On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > > Ugh, no, please don't use mount options for file specific behaviours
> > > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > > behaviour that should either just work automatically (i.e. be
> > > > completely controlled by the filesystem) or only be applied to files
> > >
> > > Can you explain what you mean? How would the file system control it?
> >
> > There's no point in asking for huge pages when populating the page
> > cache if the file is:
> >
> > - significantly smaller than the huge page size
> > - largely sparse
> > - being randomly accessed in small chunks
> > - badly fragmented and so takes hundreds of IO to read/write
> > a huge page
> > - able to optimise delayed allocation to match huge page
> > sizes and alignments
> >
> > These are all constraints the filesystem knows about, but the
> > application and user don't.
> Really?
> To me, most of things you're talking about is highly dependent on access
> pattern generated by userspace:
> - we may want to allocate huge pages from byte 1 if we know that file
> will grow;

delayed allocation takes care of that. We use a growing speculative
delalloc size that kicks in at specific sizes and can be used
directly to determine if a large page shoul dbe allocated. This code
is aware of sparse files, sparse writes, etc.

> - the same for sparse file that will be filled;

See above.

> - it will be beneficial to allocate huge page even for fragmented files,
> if it's read-mostly;

No, no it won't. The IO latency impact here can be massive.
read-ahead of single 4k pages hides most of this latency from the
application, but with a 2MB page, we can't use readhead to hide this
IO latency because the first access could stall for hundreds of
small random read IOs to be completed instead of just 1.

> > Further, we are moving the IO path to a model where we use extents
> > for mapping, not blocks. We're optimising for the fact that modern
> > filesystems use extents and so massively reduce the number of block
> > mapping lookup calls we need to do for a given IO.
> >
> > i.e. instead of doing "get page, map block to page" over and over
> > again until we've alked over the entire IO range, we're doing
> > "map extent for entire IO range" once, then iterating "get page"
> > until we've mapped the entire range.
> That's great, but it's not how IO path works *now*. And will takes a long
> time (if ever) to flip it over to what you've described.

Wrong. fs/iomap.c. XFS already uses it, ext4 is being converted
right now, GFS2 will use parts of it in the next release, DAX
already uses it and PMD support in DAX is being built on top of it.

> > As such, there is no way we should be considering different
> > interfaces and methods for configuring the /same functionality/ just
> > because DAX is enabled or not. It's the /same decision/ that needs
> > to be made, and the filesystem knows an awful lot more about whether
> > huge pages can be used efficiently at the time of access than just
> > about any other actor you can name....
> I'm not convinced that filesystem is in better position to see access
> patterns than mm for page cache. It's not all about on-disk layout.

Spoken like a true mm developer. IO performance is all about IO
patterns, and the primary contributor to bad IO patterns is bad
filesystem allocation patterns.... :P

We're rapidly moving away from the world where a page cache is
needed to give applications decent performance. DAX doesn't have a
page cache, applications wanting to use high IOPS (hundreds of
thousands to millions) storage are using direct IO, because the page
cache just introduces latency, memory usage issues and
non-deterministic IO behaviour.

I we try to make the page cache the "one true IO optimisation source"
then we're screwing ourselves because the incoming IO technologies
simply don't require it anymore. We need to be ahead of that curve,
not playing catchup, and that's why this sort of "what should the
page cache do" decisions really need to come from the IO path where
we see /all/ the IO, not just buffered IO....


Dave Chinner