Re: [PATCH] shmem: avoid huge pages for small files

From: Kirill A. Shutemov
Date: Fri Oct 21 2016 - 11:00:18 EST

On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > > Ugh, no, please don't use mount options for file specific behaviours
> > > in filesystems like ext4 and XFS. This is exactly the sort of
> > > behaviour that should either just work automatically (i.e. be
> > > completely controlled by the filesystem) or only be applied to files
> >
> > Can you explain what you mean? How would the file system control it?
> There's no point in asking for huge pages when populating the page
> cache if the file is:
> - significantly smaller than the huge page size
> - largely sparse
> - being randomly accessed in small chunks
> - badly fragmented and so takes hundreds of IO to read/write
> a huge page
> - able to optimise delayed allocation to match huge page
> sizes and alignments
> These are all constraints the filesystem knows about, but the
> application and user don't.


To me, most of things you're talking about is highly dependent on access
pattern generated by userspace:

- we may want to allocate huge pages from byte 1 if we know that file
will grow;
- the same for sparse file that will be filled;
- it will be beneficial to allocate huge page even for fragmented files,
if it's read-mostly;

> None of these aspects can be optimised sanely by a single threshold,
> especially when considering the combination of access patterns vs file
> layout.

I agree.

Here I tried to address the particular performance regression I see with
huge pages enabled on tmpfs. It doesn't mean to fix all possible issues.

> Further, we are moving the IO path to a model where we use extents
> for mapping, not blocks. We're optimising for the fact that modern
> filesystems use extents and so massively reduce the number of block
> mapping lookup calls we need to do for a given IO.
> i.e. instead of doing "get page, map block to page" over and over
> again until we've alked over the entire IO range, we're doing
> "map extent for entire IO range" once, then iterating "get page"
> until we've mapped the entire range.

That's great, but it's not how IO path works *now*. And will takes a long
time (if ever) to flip it over to what you've described.

> Hence if we have a 2MB IO come in from userspace, and the iomap
> returned is a covers that entire range, it's a no-brainer to ask the
> page cache for a huge page instead of iterating 512 times to map all
> the 4k pages needed.

Yeah, it's no-brainier.

But do we want to limit huge page allocation only to such best-possible
cases? I hardly ever seen 2MB IOs in real world...

And this approach will put too much decision power on the first access to
the file range. It may or may not represent future access pattern.

> > > specifically configured with persistent hints to reliably allocate
> > > extents in a way that can be easily mapped to huge pages.
> >
> > > e.g. on XFS you will need to apply extent size hints to get large
> > > page sized/aligned extent allocation to occur, and so this
> >
> > It sounds like you're confusing alignment in memory with alignment
> > on disk here? I don't see why on disk alignment would be needed
> > at all, unless we're talking about DAX here (which is out of
> > scope currently) Kirill's changes are all about making the memory
> > access for cached data more efficient, it's not about disk layout
> > optimizations.
> No, I'm not confusing this with DAX. However, this automatic use
> model for huge pages fits straight into DAX as well. Same
> mechanisms, same behaviours, slightly stricter alignment
> characteristics. All stuff the filesystem already knows about.
> Mount options are, quite frankly, a terrible mechanism for
> specifying filesystem policy. Setting up DAX this way was a mistake,
> and it's a mount option I plan to remove from XFS once we get nearer
> to having DAX feature complete and stablised. We've already got
> on-disk "use DAX for this file" flags in XFS, so we can easier and
> cleanly support different methods of accessing PMEM from the same
> filesystem.
> As such, there is no way we should be considering different
> interfaces and methods for configuring the /same functionality/ just
> because DAX is enabled or not. It's the /same decision/ that needs
> to be made, and the filesystem knows an awful lot more about whether
> huge pages can be used efficiently at the time of access than just
> about any other actor you can name....

I'm not convinced that filesystem is in better position to see access
patterns than mm for page cache. It's not all about on-disk layout.

Kirill A. Shutemov