RFC: hold i_rwsem until aio completes

From: Christoph Hellwig
Date: Tue Jan 14 2020 - 11:12:30 EST


Hi all,

Asynchronous read/write operations currently use a rather magic locking
scheme, were access to file data is normally protected using a rw_semaphore,
but if we are doing aio where the syscall returns to userspace before the
I/O has completed we also use an atomic_t to track the outstanding aio
ops. This scheme has lead to lots of subtle bugs in file systems where
didn't wait to the count to reach zero, and due to its adhoc nature also
means we have to serialize direct I/O writes that are smaller than the
file system block size.

All this is solved by releasing i_rwsem only when the I/O has actually
completed, but doings so is against to mantras of Linux locking primites:

(1) no unlocking by another process than the one that acquired it
(2) no return to userspace with locks held

It actually happens we have various places that work around this. A few
callers do non-owner unlocks of rwsems, which are pretty nasty for
PREEMPT_RT as the owner tracking doesn't work. OTOH the file system
freeze code has both problems and works around them a little better,
although in a somewhat awkward way, in that it releases the lockdep
object when returning to userspace, and reacquires it when done, and
also clears the rwsem owner when returning to userspace, and then sets
the new onwer before unlocking.

This series tries to follow that scheme, also it doesn't fully work. The
first issue is that the rwsem code has a bug where it doesn't properly
handle clearing the owner. This series has a patch to fix that, but it
is ugly and might not be correct so some help is needed. Second I/O
completions often come from interrupt context, which means the re-acquire
is recorded as from irq context, leading to warnings about incorrect
contexts. I wonder if we could just have a bit in lockdep that says
returning to userspace is ok for this particular lock? That would also
clean up the fsfreeze situation a lot.

Let me know what you think of all this. While I converted all the iomap
using file systems only XFS is actually tested.

Diffstat:

24 files changed, 144 insertions(+), 180 deletions(-)