Re: [PATCH 5/8] xfs: Protect xfs_file_aio_write() &xfs_setattr_size() with sb_start_write - sb_end_write

From: Jan Kara
Date: Tue Jan 24 2012 - 14:35:35 EST


On Tue 24-01-12 18:19:26, Dave Chinner wrote:
> On Fri, Jan 20, 2012 at 09:34:43PM +0100, Jan Kara wrote:
> > Replace racy xfs_wait_for_freeze() check in xfs_file_aio_write() with
> > a reliable sb_start_write() - sb_end_write() locking. Due to lock ranking
> > dictated by the page fault code we have to call sb_start_write() after we
> > acquire ilock.
>
> It appears to me that you have indeed confused the ilock with the
> iolock.
>
> > Similarly we have to protect xfs_setattr_size() because it can modify last
> > page of truncated file. Because ilock is dropped in xfs_setattr_size() we
> > have to drop and retake write access as well to avoid deadlocks.
>
> >
> > CC: Ben Myers <bpm@xxxxxxx>
> > CC: Alex Elder <elder@xxxxxxxxxx>
> > Signed-off-by: Jan Kara <jack@xxxxxxx>
> > ---
> > fs/xfs/xfs_file.c | 6 ++++--
> > fs/xfs/xfs_iops.c | 6 ++++++
> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 753ed9b..9efd153 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -862,9 +862,11 @@ xfs_file_dio_aio_write(
> > *iolock = XFS_IOLOCK_SHARED;
> > }
> >
> > + sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> > trace_xfs_file_direct_write(ip, count, iocb->ki_pos, 0);
> > ret = generic_file_direct_write(iocb, iovp,
> > &nr_segs, pos, &iocb->ki_pos, count, ocount);
> > + sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
>
> That's inside the iolock, not the ilock. Either way, it is
> incorrect. This accounting should be outside the iolock - because
> xfs_trans_alloc() can be called with the iolock held. Therefore the
> freeze/lock order needs to be
>
> sb_start_write(SB_FREEZE_WRITE)
> XFS(ip)->i_iolock
> XFS(ip)->i_ilock
> sb_end_write(SB_FREEZE_WRITE)
>
> Which matches the current freeze/lock order.
Hmm, so I was looking at this and I think there are following locking
constrants (please correct me if I have something wrong):
iolock -> trans start (per your claim above)
trans start -> ilock (ditto)
iolock -> mmap_sem (aio write holds iolock and copying data from userspace
might need mmap sem if it hits page fault)
mmap_sem -> ilock (do_wp_page -> block_page_mkwrite -> __xfs_get_blocks)
freezing -> trans start (so that we can clean the filesystem during
freezing)

So I see two choices here.
1) Put 'freezing' above iolock as you suggest. But then handling the page
fault path becomes challenging. We cannot block there easily because we are
called with mmap_sem held. I just talked with Mel and it seems that
dropping mmap_sem in ->page_mkwrite(), blocking, retaking mmap_sem and
returning VM_FAULT_RETRY might work but we'll see whether some other mm guy
won't kill me for that ;).
2) Put 'freezing' below mmap_sem. That would put it below iolock/i_mutex
as well. Then handling page fault is easy. We could not block in ->aio_write
but we'd have to block in ->write_begin() instead. Similarly we would have
to block in other write paths.

The first approach has the advantage that we could put lots of frozen
checks into VFS thus making them shared among filesystems (possibly even
making freezing reliable for filesystems such as ext2). The second approach
is simpler as we could do most of the freezing checks while we start a
transaction at least for filesystems that have transactions... Any
preferences?

Honza

> > @@ -945,8 +949,6 @@ xfs_file_aio_write(
> > if (ocount == 0)
> > return 0;
> >
> > - xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
> > -
>
> that's where sb_start_write() needs to be, and the sb-end_write()
> call needs to below the generic_write_sync() calls that will trigger
> IO on O_SYNC writes. Otherwise it is not covering all the IO path
> correctly.
>
> > if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> > return -EIO;
> >
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 3579bc8..798b9c6 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -793,6 +793,7 @@ xfs_setattr_size(
> > return xfs_setattr_nonsize(ip, iattr, 0);
> > }
> >
> > + sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> > /*
> > * Make sure that the dquots are attached to the inode.
> > */
> > @@ -849,10 +850,14 @@ xfs_setattr_size(
> > xfs_get_blocks);
> > if (error)
> > goto out_unlock;
> > + /* Drop the write access to avoid lock inversion with ilock */
> > + sb_end_write(inode->i_sb, SB_FREEZE_WRITE);
> >
> > xfs_ilock(ip, XFS_ILOCK_EXCL);
> > lock_flags |= XFS_ILOCK_EXCL;
> >
> > + sb_start_write(inode->i_sb, SB_FREEZE_WRITE);
> > +
>
> This is caused by the previous problems I pointed out. You should
> not need to drop the freeze reference here at all.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/