Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

From: Kirill A. Shutemov
Date: Wed Nov 02 2016 - 04:32:16 EST


On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > If I understand the motivation right, it is mostly about being able to mmap
> > > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > page cache, we don't even have to read them all unless we come from PMD
> > > fault path.
> >
> > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > per-hugepage, one common list of buffer heads...
> >
> > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > it otherwise doesn't make sense) and handling it differently for file-THP
> > is nightmare from maintenance POV.
>
> But the complexity of two different page sizes for page cache and *each*
> filesystem that wants to support it does not make the maintenance easy
> either.

I think with time we can make small pages just a subcase of huge pages.
And some generalization can be made once more than one filesystem with
backing storage will adopt huge pages.

> So I'm not convinced that using the same rules for anon-THP and
> file-THP is a clear win.

We already have file-THP with the same rules: tmpfs. Backing storage is
what changes the picture.

> And if we have these two options neither of which has negligible
> maintenance cost, I'd also like to see more justification for why it is
> a good idea to have file-THP for normal filesystems. Do you have any
> performance numbers that show it is a win under some realistic workload?

See below. As usual with huge pages, they make sense when you plenty of
memory.

> I'd also note that having PMD-sized pages has some obvious disadvantages as
> well:
>
> 1) I'm not sure buffer head handling code will quite scale to 512 or even
> 2048 buffer_heads on a linked list referenced from a page. It may work but
> I suspect the performance will suck.

Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
why syscall-based IO sucks. We spend a lot of time looking for desired
block.

We need to switch to some other data structure for storing buffer_heads.
Is there a reason why we have list there in first place?
Why not just array?

I will look into it, but this sounds like a separate infrastructure change
project.

> 2) PMD-sized pages result in increased space & memory usage.

Space? Do you mean disk space? Not really: we still don't write beyond
i_size or into holes.

Behaviour wrt to holes may change with mmap()-IO as we have less
granularity, but the same can be seen just between different
architectures: 4k vs. 64k base page size.

> 3) In ext4 we have to estimate how much metadata we may need to modify when
> allocating blocks underlying a page in the worst case (you don't seem to
> update this estimate in your patch set). With 2048 blocks underlying a page,
> each possibly in a different block group, it is a lot of metadata forcing
> us to reserve a large transaction (not sure if you'll be able to even
> reserve such large transaction with the default journal size), which again
> makes things slower.

I didn't saw this on profiles. And xfstests looks fine. I probably need to
run them with 1k blocks once again.

> 4) As you have noted some places like write_begin() still depend on 4k
> pages which creates a strange mix of places that use subpages and that use
> head pages.

Yes, this need to be addressed to restore syscall-IO performance and take
advantage of huge pages.

But again, it's an infrastructure change that would likely affect
interface between VFS and filesystems. It deserves a separate patchset.

> All this would be a non-issue (well, except 2 I guess) if we just didn't
> expose filesystems to the fact that something like file-THP exists.

The numbers below generated with fio. The working set is relatively small,
so it fits into page cache and writing set doesn't hit dirty_ratio.

I think the mmap performance should be enough to justify initial inclusion
of an experimental feature: it useful for workloads that targets mmap()-IO.
It will take time to get feature mature anyway.

Configuration:
- 2x E5-2697v2, 64G RAM;
- INTEL SSDSC2CW24;
- IO request size is 4k;
- 8 processes, 512MB data set each;

Workload
read/write baseline stddev huge=always stddev change
--------------------------------------------------------------------------------
sync-read
read 21439.00 348.14 20297.33 259.62 -5.33%
sync-write
write 6833.20 147.08 3630.13 52.86 -46.88%
sync-readwrite
read 4377.17 17.53 2366.33 19.52 -45.94%
write 4378.50 17.83 2365.80 19.94 -45.97%
sync-randread
read 5491.20 66.66 14664.00 288.29 167.05%
sync-randwrite
write 6396.13 98.79 2035.80 8.17 -68.17%
sync-randrw
read 2927.30 115.81 1036.08 34.67 -64.61%
write 2926.47 116.45 1036.11 34.90 -64.60%
libaio-read
read 254.36 12.49 258.63 11.29 1.68%
libaio-write
write 4979.20 122.75 2904.77 17.93 -41.66%
libaio-readwrite
read 2738.57 142.72 2045.80 4.12 -25.30%
write 2729.93 141.80 2039.77 3.79 -25.28%
libaio-randread
read 113.63 2.98 210.63 5.07 85.37%
libaio-randwrite
write 4456.10 76.21 1649.63 7.00 -62.98%
libaio-randrw
read 97.85 8.03 877.49 28.27 796.80%
write 97.55 7.99 874.83 28.19 796.77%
mmap-read
read 20654.67 304.48 24696.33 1064.07 19.57%
mmap-write
write 8652.33 272.44 13187.33 499.10 52.41%
mmap-readwrite
read 6620.57 16.05 9221.60 399.56 39.29%
write 6623.63 16.34 9222.13 399.31 39.23%
mmap-randread
read 6717.23 1360.55 21939.33 326.38 226.61%
mmap-randwrite
write 3204.63 253.66 12371.00 61.49 286.03%
mmap-randrw
read 2150.50 78.00 7682.67 188.59 257.25%
write 2149.50 78.00 7685.40 188.35 257.54%

--
Kirill A. Shutemov