Re: fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken [PATCH]

From: Dave Chinner
Date: Tue Mar 29 2016 - 18:38:32 EST

On Mon, Mar 28, 2016 at 10:04:10PM -0800, Kent Overstreet wrote:
> On Tue, Mar 29, 2016 at 04:15:58PM +1100, Dave Chinner wrote:
> > On Mon, Mar 28, 2016 at 08:25:46PM -0800, Kent Overstreet wrote:
> > > Bit of previous discussion:
> > >
> > >
> > > The underlying issue is that we have no mechanism for invalidating a range of
> > > the pagecache and then _keeping it invalidated_ while we Do Stuff.
> > >
> > > The fallocate INSERT_RANGE/COLLAPSE_RANGE situation seems likely to be worse
> > > than I initially thought. I've been digging into this in the course of bcachefs
> > > testing - I was hitting assertions that meant state hanging off the page cache
> > > (in this case, allocation information, i.e. whether we needed to reserve space
> > > on write) was inconsistent with the btree in writepages().
> > >
> > > Well, bcachefs isn't the only filesystem that hangs additional state off the
> > > pagecache, and the situation today is that an unpriviliged user can cause
> > > inconsistencies there by just doing buffered reads concurrently with
> > > INSERT_RANGE/COLLAPSE_RANGE. I highly highly doubt this is an issue of just
> > > "oops, you corrupted your file because you were doing stupid stuff" - who knows
> > > what internal invariants are getting broken here, and I don't particularly care
> > > to find out.
> >
> > I'd like to see a test case for this. Concurrent IO and/or page
> > faults should not run at the same as fallocate on XFS. Hence I'd
> > like to see the test cases that demonstrate buffered reads are
> > causing corruption during insert/collapse range operations. We use
> > the same locking strategy for fallocate as we use for truncate and
> > all the other internal extent manipulation operations, so if there's
> > something wrong, we need to fix it.
> It's entirely possible I'm wrong about XFS - your fault path locking looked
> correct, and I did see you had extra locking in your buffered read path but I
> thought it was a different lock. I'll recheck later, but for the moment I'm just
> going to assume I misspoke (and tbh always found xfs's locking to be quite
> rigorous).

There are two locks the XFS_IOLOCK for read/write/splice IO path vs
truncate/fallocate exclusion, and XFS_MMAPLOCK for page fault vs
truncate/fallocate exclusion.

> ext4 uses the generic code in all the places you're hooking into though -
> .fault, .read_iter, etc.
> The scheme I've got in this patch should perform quite a bit better than what
> you're doing - only locking in the slow cache miss path, vs. every time you
> touch the page cache.

I'm not sure I follow - how does this work when a fallocate
operation use the page cache for, say, zeroing data blocks rather
than invalidating them (e.g. FALLOC_FL_ZERO_RANGE can validly zero
blocks through the page cache, so can hole punching)? Won't the
buffered read then return a mix of real and zeroed data, depending
who wins the race to each underlying page lock?

i.e. if the locking only occurs in the page insert slow path, then
it doesn't provide sufficient exclusion for extent manipulation
operations that use the page cache during their normal operation.
IOWs, other, higher level synchronisation methods for fallocate
are still necessary....


Dave Chinner