Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

From: Jan Kara
Date: Thu Nov 03 2016 - 13:56:17 EST


On Wed 02-11-16 07:36:12, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> >
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck.
>
> buffer_head handling always sucks. For the iomap based bufferd write
> path I plan to support a buffer_head-less mode for the block size ==
> PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
> plate in time even for 4.10. I think that's the right way to go for
> THP, especially if we require the fs to allocate the whole huge page
> as a single extent, similar to the DAX PMD mapping case.

Yeah, if we require whole THP to be backed by a single extent, things get
simpler. But still there's the issue that ext4 cannot easily use iomap code
for buffered writes because of the data exposure issue we already talked
about - well, ext4 could actually work (it supports unwritten extents) but
old compatibility modes won't work and I'd strongly prefer not to have two
independent write paths in ext4... But I'll put more thought into this, I
have some idea how we could hack around the problem even for on-disk formats
that don't support unwritten extents. The trick we could use is that we'd
just mark the range of file as unwritten in memory in extent cache we have,
that should protect us against exposing uninitialized pages in racing
faults.

> > 2) PMD-sized pages result in increased space & memory usage.
>
> How so?

Well, memory usage is clear I guess - if the files are smaller than THP
size, or if you don't use all the 4k pages that are forming THP you are
wasting memory. Sure it can be somewhat controlled by the heuristics
deciding when to use THP in pagecache and when to fall back to 4k pages.

Regarding space usage - it is mostly the case for sparse mmaped IO where
you always have to allocate (and write out) all the blocks underlying a THP
that gets written to, even though you may only need 4K from that area...

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
>
> As said above I think we should only use huge page mappings if there is
> a single underlying extent, same as in DAX to keep the complexity down.
>
> > 4) As you have noted some places like write_begin() still depend on 4k
> > pages which creates a strange mix of places that use subpages and that use
> > head pages.
>
> Just use the iomap bufferd I/O code and all these issues will go away.

Yep, the above two things would make things somewhat less ugly I agree.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR