Re: fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken [PATCH]
From: Kent Overstreet
Date: Tue Mar 29 2016 - 19:46:56 EST
On Wed, Mar 30, 2016 at 09:37:43AM +1100, Dave Chinner wrote:
> There are two locks the XFS_IOLOCK for read/write/splice IO path vs
> truncate/fallocate exclusion, and XFS_MMAPLOCK for page fault vs
> truncate/fallocate exclusion.
Ok, that makes sense then.
> > ext4 uses the generic code in all the places you're hooking into though -
> > .fault, .read_iter, etc.
> >
> > The scheme I've got in this patch should perform quite a bit better than what
> > you're doing - only locking in the slow cache miss path, vs. every time you
> > touch the page cache.
>
> I'm not sure I follow - how does this work when a fallocate
> operation use the page cache for, say, zeroing data blocks rather
> than invalidating them (e.g. FALLOC_FL_ZERO_RANGE can validly zero
> blocks through the page cache, so can hole punching)? Won't the
> buffered read then return a mix of real and zeroed data, depending
> who wins the race to each underlying page lock?
Yes, but does that matter? I don't know of any requirement that the entire
fallocate be atomic, and writes > PAGE_SIZE aren't generally atomic wrt reads
either (but looking at the code it looks like XFS is special there too...)
> i.e. if the locking only occurs in the page insert slow path, then
> it doesn't provide sufficient exclusion for extent manipulation
> operations that use the page cache during their normal operation.
> IOWs, other, higher level synchronisation methods for fallocate
> are still necessary....
Well, I would argue it _is_ enough for internal consistency, if you want to
provide stronger atomicity guarantees to userspace (not that there's anything
wrong with that) I think I'd argue that really that's a different issue.
Also the situation seems to be that XFS provides stronger guarantees than any
other Linux filesystem, which is sucky from an application point of view because
they don't want to have to depend on running on XFS for correctness. It seems
like it'd actually be pretty easy to lift your locking scheme to the generic VFS
code, and then possibly even have it be controlled with a per inode flag so that
applications that don't need the locking aren't paying for it.