Re: [PATCH] shmem: avoid huge pages for small files
From: Dave Chinner
Date: Tue Oct 25 2016 - 01:29:10 EST
On Mon, Oct 24, 2016 at 01:34:53PM -0700, Dave Hansen wrote:
> On 10/21/2016 03:50 PM, Dave Chinner wrote:
> > On Fri, Oct 21, 2016 at 06:00:07PM +0300, Kirill A. Shutemov wrote:
> >> On Fri, Oct 21, 2016 at 04:01:18PM +1100, Dave Chinner wrote:
> >> To me, most of things you're talking about is highly dependent on access
> >> pattern generated by userspace:
> >>
> >> - we may want to allocate huge pages from byte 1 if we know that file
> >> will grow;
> >
> > delayed allocation takes care of that. We use a growing speculative
> > delalloc size that kicks in at specific sizes and can be used
> > directly to determine if a large page shoul dbe allocated. This code
> > is aware of sparse files, sparse writes, etc.
>
> OK, so somebody does a write() of 1 byte. We can delay the underlying
> block allocation for a long time, but we can *not* delay the memory
> allocation. We've got to decide before the write() returns.
> How does delayed allocation help with that decision?
You (and Kirill) have likely misunderstood what I'm saying, based
on the fact you are thinking I'm talking about delayed allocation of
page cache pages.
I'm not. The current code does this for a sequential write:
write( off, len)
for each PAGE_SIZE chunk
grab page cache page
alloc + insert if not found
map block to page
get_block(off, PAGE_SIZE);
filesystem does allocation <<<<<<<<<<<<
update bufferhead attached to page
write data into page
Essentially, delayed block allocation occurs inside the get_block()
call, completely hidden from the page cache and IO layers.
In XFS, we special stuff based on the offset being written to, the
size of the existing extent we are adding, etc. to specualtively
/reserve/ more blocks that the write actually needs and keep them in
a delalloc extent that extends beyond EOF.
The next page is grabbed, get_block is called again, and we find
we've already got a delalloc reservation for that file offset, so
we return immediately. And we repeat that until the delalloc extent
runs out.
When it runs out, we allocate a bigger delalloc extent beyond EOF
so that as the file grows we do fewer and fewer larger delayed
allocation reservations. These grow out to /gigabytes/ if there is
that much data to write.
i.e. the filesystem clearly knows when using large pages would be
appropriate, but because it's inside the page cache allocation,
it can't influence it at all.
Here's what the new fs/iomap.c code does for the same sequential
write:
write(off, len)
iomap_begin(off, len)
filesystem does delayed allocation of at least len bytes <<<<<<
returns an iomap with single mapping
iomap_apply()
for each PAGE_SIZE chunk
grab page cache page
alloc + insert if not found
map iomap to page
write data into page
iomap_end()
Hence if the write was for 1 byte into an empty file, we'd get a
single block extent back, which would match to a single PAGE_SIZE
page cache allocation required. If the app is doing sequential 1
byte IO, and we're at offset 2MB and the filesystem returns a 2MB
delalloc extent (i.e. extends 2MB byte beyond EOF), we know we're
getting sequential write IO and we could use a 2MB page in the page
cache for this.
Similarly, if the app is doing large IO - say 16MB at a time, we'll
get at least a 16MB delalloc extent returned from the filesystem,
and we know we could map that quickly and easily to 8 x 2MB huge
pages in the page cache.
But if we get random 4k writes into a sparse file, the filesystem
will be allocating single blocks, so the iomaps being returned would
be for a single block, and we know that PAGE_SIZE pages would be
best to allocate.
Or we could have a 2 MB extent size hint set, so every iomap
returned from the filesysetm is going to be 2MB aligned and sized,
in which case we'd could always map the returned iomap to a huge
page rather than worry about Io sizes and incoming IO patterns.
Now do you see the difference? The filesystem now has the ability to
optimise allocation based on user application IO patterns rather
than trying to guess from single page size block mapping requests.
And because this happens before we look at the page cache, we can
use that information to influence what we do with the page cache.
> >> I'm not convinced that filesystem is in better position to see access
> >> patterns than mm for page cache. It's not all about on-disk layout.
> >
> > Spoken like a true mm developer. IO performance is all about IO
> > patterns, and the primary contributor to bad IO patterns is bad
> > filesystem allocation patterns.... :P
>
> For writes, I think you have a good point. Managing a horribly
> fragmented file with larger pages and eating the associated write
> magnification that comes along with it seems like a recipe for disaster.
>
> But, Isn't some level of disconnection between the page cache and the
> underlying IO patterns a *good* thing?
Up to a point. Buffered IO only hides underlying IO patterns whilst
readahead and writebehind are effective. They rely on the filesystem
laying out files sequentially for good read rates, and writes to hit
the disk sequentially for good write rates. The moment either of
these two things break down, the latency induced by the underlying
IO patterns will immediately dominate buffered IO performance
characteristics.
The filesystem allocation and mapping routines will do a far better
job is we can feed them extents rather than 4k pages at a time.
Kirill is struggling to work out how to get huge pages into the page
cache readahead model, and this is something that we'll solve by
adding a readahead implemenation to the iomap code. i.e. iomap the
extent, decide on how much to read ahead and the page size to use,
populate the page cache, build the bios, off we go....
The problem with the existing code is that the page cache breaks IO
up into discrete PAGE_SIZE mapping requests. For all the fs knows it
could be 1 byte IOs or it could be 1GB IOs being done. We can do
(and already have done) an awful lot of optimisation if we know
exactly what IO is being thrown at the fs.
Effectively we are moving the IO path mapping methods from a
single-page-and-block iteration model to single-extent-multiple-page
iteration model. It's a different mechanism, but it's a lot more
efficient for moving data around than what we have now...
> Once we've gone to the trouble
> of bringing some (potentially very fragmented) data into the page cache,
> why _not_ manage it in a lower-overhead way if we can? For read-only
> data it seems like a no-brainer that we'd want things in as large of a
> management unit as we can get.
>
> IOW, why let the underlying block allocation layout hamstring how the
> memory is managed?
We're not changing how memory is managed at all. All we are doing is
is changing how we populate the page cache....
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx