Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file_read()

From: Kirill A. Shutemov
Date: Mon Nov 07 2016 - 06:32:19 EST

On Thu, Nov 03, 2016 at 09:40:12PM +0100, Jan Kara wrote:
> On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> > Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> > why syscall-based IO sucks. We spend a lot of time looking for desired
> > block.
> >
> > We need to switch to some other data structure for storing buffer_heads.
> > Is there a reason why we have list there in first place?
> > Why not just array?
> >
> > I will look into it, but this sounds like a separate infrastructure change
> > project.
> As Christoph said iomap code should help you with that and make things
> simpler. If things go as we imagine, we should be able to pretty much avoid
> buffer heads. But it will take some time to get there.

Just to clarify: is it show-stopper or we can live with buffer_head list
for now?

> > > 2) PMD-sized pages result in increased space & memory usage.
> >
> > Space? Do you mean disk space? Not really: we still don't write beyond
> > i_size or into holes.
> >
> > Behaviour wrt to holes may change with mmap()-IO as we have less
> > granularity, but the same can be seen just between different
> > architectures: 4k vs. 64k base page size.
> Yes, I meant different granularity of mmap based IO. And I agree it isn't a
> new problem but the scale of the problem is much larger with 2MB pages than
> with say 64K pages. And actually the overhead of higher IO granularity of
> 64K pages has been one of the reasons we have switched SLES PPC kernels
> from 64K pages to 4K pages (we've got complaints from customers).

I guess fadvise()/madvise() hints for opt-in/opt-out should be good enough
to deal with this. I probably need to wire them up.

> > > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > > allocating blocks underlying a page in the worst case (you don't seem to
> > > update this estimate in your patch set). With 2048 blocks underlying a page,
> > > each possibly in a different block group, it is a lot of metadata forcing
> > > us to reserve a large transaction (not sure if you'll be able to even
> > > reserve such large transaction with the default journal size), which again
> > > makes things slower.
> >
> > I didn't saw this on profiles. And xfstests looks fine. I probably need to
> > run them with 1k blocks once again.
> You wouldn't see this in profiles - it is a correctness thing. And it won't
> be triggered unless the file is heavily fragmented which likely does not
> happen with any test in xfstests. If it happens you'll notice though - the
> filesystem will just report error and shut itself down.

Any suggestion how I can simulate this situation?

> > The numbers below generated with fio. The working set is relatively small,
> > so it fits into page cache and writing set doesn't hit dirty_ratio.
> >
> > I think the mmap performance should be enough to justify initial inclusion
> > of an experimental feature: it useful for workloads that targets mmap()-IO.
> > It will take time to get feature mature anyway.
> I agree it will take time for feature to mature so I'me fine with
> suboptimal performance in some cases. But I'm not fine with some of the
> hacks you do currently because code maintenability is an issue even if
> people don't actually use the feature...

Hm. Okay, I'll try to check what I can do to make it more maintainable.
My worry is that it will make the patchset even bigger...

> > Configuration:
> > - 2x E5-2697v2, 64G RAM;
> > - IO request size is 4k;
> > - 8 processes, 512MB data set each;
> The numbers indeed look interesting for mmaped case. Can you post the fio
> cmdline? I'd like to compare profiles...

fio \
--directory=/mnt/ \
--name="$engine-$rw" \
--ioengine="$engine" \
--rw="$rw" \
--size=512M \
--invalidate=1 \
--numjobs=8 \
--runtime=60 \
--time_based \

Kirill A. Shutemov