[RFC] commit_write less than prepared by prepare_write

From: Nick Piggin
Date: Sun Oct 22 2006 - 04:40:47 EST


This is a request for comments and review by filesystem maintainers.
Especially if you own GFS2, OCFS2, Reiserfs, JFFS! Those using fs/buffer.c
routines directly should be reasonably safe.

We have several long standing deadlocks in the core MM when doing
buffered writes. The full explanation is in 2.6.19-rc2-mm2 in the
patch mm-fix-pagecache-write-deadlocks.patch and there are some followup
fixes.

Basically: when copying the pages from user supplied address to pagecache,
between prepare_write and commit_write, we can take a fault on the
copy_from_user and enter the pagefault handler holding the page lock.

There is no way that this can work safely, so we must do atomic copies
here, which will just return a short write in the case of a fault.

The protocol for handling short writes is as follows:

- If page uptodate, commit_write with the shorter length (potentially 0).
Move the counters forward and retry.

- If the page is not uptodate, we commit a zero length commit_write, at the
start offset given by the prepare_write being closed.

* Note that if we hit some other error condition, we may never run another
prepare_write or commit_write, so we can't rely on that.

For the case of an uptodate page, it may be somewhat safer to commit the
full length that was prepared (we can always pretend that the uncopied
part got dirtied; the retry can happily overwrite it). Comments? I figure
that if we have to audit everything anyway, then keeping this wart would
be silly.

For the non-uptodate page:
We can't run a full length commit_write, because the data in the page is
uninitialised. Can't zero out the uninitialised data, because that would
lead to zeros temporarily appearing in the pagecache. Any other ways to
fix it?

The problem with doing a short commit, as noted by Mark Fasheh, is that
prepare_write might have allocated a disk mapping (eg. if we write over a
hole), and if the commit fails and we unlock the page, then a read from
the filesystem might think that is valid data.

Mark's idea is to loop through the page buffers and zero out the
buffer_new ones when the caller detects a fault here. This should work
because a hole is zeroes anyway, so we could mark the page uptodate and it
would eventually get written back to initialise those blocks. But, is
there any other filesystem specific state that would need to be fixed
up? Do we need another filesystem call?

I would prefer that disk block allocation be performed at commit_write
time. I realise that is probably a lot more rework, and there may be
reasons why it can't be done?

Another alternative is to pre-read the page in prepare_write, which
should always get us out of trouble. Could be a good option for
simple filesystems. People who care about performance probably want to
avoid this.

Any other ideas?

The regular buffered filesystems seem to be working OK. The bulk of the
filesystems are in fs-prepare_write-fixes.patch, but it is a bit
scattered at the moment. Applying the full -mm patch would be a good
idea.

The conversion involves:
* ensure we don't mark the page uptodate in prepare_write if it is not
(which is a bug anyway... I found a few of those).
* ensure we don't mark a non uptodate page as uptodate if the commit
was short.

Filesystems that are non trivial are GFS2, OCFS2, Reiserfs, JFFS so
I need maintainers to look at those.

For the "moronic filesystems that do not allow holes", the cont_ routines
have already been using zero length prepare/commit for something else.
OGAWA Hirofumi is thinking about this. Any comments?

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/