Re: + fs-prepare_write-fixes.patch added to -mm tree

From: Nick Piggin
Date: Thu Oct 19 2006 - 01:25:57 EST


On Wed, Oct 18, 2006 at 06:42:09PM -0700, Mark Fasheh wrote:
> Hi Nick,
>
> On Wed, Oct 18, 2006 at 02:50:21PM -0700, akpm@xxxxxxxx wrote:
> > Some prepare/commit_write implementations have possible pre-existing bugs,
> > possible data leaks (by setting uptodate too early) and data corruption (by
> > not reading in non-modified parts of a !uptodate page).
> >
> > Others are (also) broken when commit_write passes in a 0 length commit with
> > a !uptodate page (a change caused by buffered write deadlock fix patch).
> >
> > Fix filesystems as best we can. GFS2, OCFS2, Reiserfs, JFFS are nontrivial
> > and are likely broken. All others at least need a glance.
> I would have liked a CC on this patch, considering that ypu might have just
> broken ocfs2 :) I wouldn't have even seen the patch had I not been looking
> through my mm-commits mailbox for an unrelated patch.

I sent an RFC to linux-fsdevel, did you get that?

I was planning to cc some maintainers, including you, for those
filesystems that are non-trivial. I just hadn't had a chance to
test it properly last night.

> > commit_write: If prepare_write succeeds, new data will be copied
> > - into the page and then commit_write will be called. It will
> > - typically update the size of the file (if appropriate) and
> > - mark the inode as dirty, and do any other related housekeeping
> > - operations. It should avoid returning an error if possible -
> > - errors should have been handled by prepare_write.
> > + into the page and then commit_write will be called. commit_write may
> > + be called with a range that is smaller than that passed in to
> > + prepare_write, it could even be zero. If the page is not uptodate,
> > + the range will *only* be zero or the full length that was passed to
> > + prepare_write, if it is zero, the page should not be marked uptodate
> > + (success should still be returned, if possible -- the write will be
> > + retried).
> Doesn't this scheme have the potential to leave dirty data in holes? If a
> file system does it's allocation for a hole in the middle of a file (so no
> i_size update) in ->prepare_write(), but then in ->commit_write() it's not
> allowed to write out the page, or add it to a transaction (in the case of
> ext3/ocfs2 ordered writes), we might commit the allocation tree changes
> without writing data to the actual disk region that the allocation covers. A
> subsequent read would then return whatever junk was on disk.

That might be the case, I suppose.

If you don't want to track this yourself internally, we could think
about adding a new callback for the non-committed region if that
would help? (->abort_write).

>
>
> > diff -puN fs/ext3/inode.c~fs-prepare_write-fixes fs/ext3/inode.c
> > --- a/fs/ext3/inode.c~fs-prepare_write-fixes
> > +++ a/fs/ext3/inode.c
> > @@ -1214,21 +1214,24 @@ static int ext3_ordered_commit_write(str
> > struct inode *inode = page->mapping->host;
> > int ret = 0, ret2;
> >
> > - ret = walk_page_buffers(handle, page_buffers(page),
> > - from, to, NULL, ext3_journal_dirty_data);
> > + if (to - from > 0) {
> > + ret = walk_page_buffers(handle, page_buffers(page),
> > + from, to, NULL, ext3_journal_dirty_data);
> I think this perhaps illustrates my worry. If we don't add all the page
> buffers to the transaction which cover a hole that was filled in
> ->prepare_write(), we'll commit at the bottom of ext3_ordered_commit_write()
> without having covered the entire range which we allocated for.
>
>
> What probably needs to happen is one of two things:
>
> a) The file system rolls back the allocation
>
> b) We somehow write zero's to the part of the region skipped before
> ->commit_write() returns.

OK thanks for looking at that. If the length of the commit is greater
than 0 (but still short), then the page is uptodate so it should be
fine to commit what we have written, I think?

If the length is zero, then we probably want to roll back entirely.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/