Re: [PATCH 0/4] Fix filesystem freezing

From: Dave Chinner
Date: Thu Jan 12 2012 - 19:09:43 EST


On Thu, Jan 12, 2012 at 12:30:31PM +0100, Jan Kara wrote:
> On Thu 12-01-12 13:48:41, Dave Chinner wrote:
> > On Thu, Jan 12, 2012 at 02:20:49AM +0100, Jan Kara wrote:
> > >
> > > Hello,
> > >
> > > filesystem freezing is currently racy and thus we can end up with dirty data
> > > on frozen filesystem (see changelog of the first patch for detailed race
> > > description and proposed fix). This patch series aims at fixing this.
> >
> > It only fixes the dirty data race (i.e. SB_FREEZE_WRITE). The same
> > race conditions exist for SB_FREEZE_TRANS on XFS, and so need the
> > same fix. That race has had one previous attempt at fixing it in
> > XFS but that's not possible:
> >
> > b2ce397 Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
> > 7a249cf xfs: fix filesystsem freeze race in xfs_trans_alloc
> >
> > It was looking at that problem earlier today that lead to the
> > solution Eric proposed. Essentially the method in these patches
> > needs to replace the xfs specifc m_active_trans counter and delay
> > during ->fs_freeze to prevent that race condition....
> OK, I see. I just checked ext4 to make sure and ext4 seems to get this
> right. Looking into Christoph's original patch it shouldn't be hard to fix
> it. Instead of:
> atomic_inc(&mp->m_active_trans);
>
> if (wait_for_freeze)
> xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>
> we just need to do a bit more elaborate
>
> retry:
> if (wait_for_freeze)
> xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> atomic_inc(&mp->m_active_trans);
> if (wait_for_freeze && mp->m_super->s_frozen >= SB_FREEZE_TRANS) {
> atomic_dec(&mp->m_active_trans);
> goto retry;
> }
>
> Or does XFS support nested transactions (i.e. a thread already holding a
> running transaction can call into xfs_trans_alloc() again)?
> That would make things more complicated...

You're still missing the point - that this isn't an XFS specific
problem or that the write problem is a ext4 specific problem. The
problem is that these are freeze state transition problems -
something that can affect every filesystem because the freeze code
is generic. Quite frankly, I'm not interested in having a generic
solution for SB_FREEZE_WRITE and a custom, per filesystem solution
for SB_FREEZE_TRANS when the solution is exactly the same.

> Using sb_start_write() instead of m_active_trans won't be that easy because
> it can create A-A deadlocks (e.g. we do sb_start_write in
> block_page_mkwrite() and then xfs_get_blocks() decides to start a
> transaction and calls sb_start_write() again which might block if
> filesystem freezing started in the mean time).

So, like Eric said in his first email, it's not a "write start/end"
interface that is needed, the interface has to work with different
freeze levels (e.g "sb_freeze_ref(sb, level)/sb_freeze_drop(sb,
level)"). Sure, internally it would have to map to two counters and
different level checks, but it solves the same problem for all
levels of freeze for all filesystems.

Let's fix this freeze problem once and for all in the generic code,
and not have to keep coming back to it to add more functioanlity for
different situations the most recent fix didn't handle for random
filesystem X....

> So it's up to XFS maintainers to decide what's best but I'd take
> Christoph's patch with above fixup. I guess I'll put it in this series and
> see what people say.

Eric and I have already discussed and agreed to replacing the XFS
sepcific code with the fixed VFS level API where other XFS
developers including the "XFS Maintainers" (*) can see. Nobody has
objected so I doubt there's any problem with doing so.

Besides, anything that replaces custom XFS code with a better
generic solution is pretty much guaranteed to be done. And given
that this is not an XFS specifc problem and it needs be fixed at
the VFS level.....

Cheers,

Dave.

[*] keep in mind that "XFS Maintainer" is just a figurehead who
maintains the tree that is sent to Linus, not the person with final
say over what changes are made. That decision is made by the
reviewers of the code...

--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/