[Patch] fix for ext2fs data corrupter (present in all kernels ever)

Stephen C. Tweedie (sct@redhat.com)
Tue, 1 Jun 1999 19:37:41 +0100 (BST)


Hi,

We have a problem in ext2_file_write. Our handling of buffers which are
not already in the buffer cache is fundamentally broken.

The code basically does a

bh = ext2_getblk (inode, block, 1, &err);
c -= copy_from_user (bh->b_data + offset, buf, c);
update_vm_cache(inode, pos, bh->b_data + offset, c);
mark_buffer_uptodate(bh, 1);
mark_buffer_dirty(bh, 0);

and we have two problems. The problem most likely to hit us is a new
one introduced with the page cache, if we collide with brw_page().

brw_page creates temporary buffers for the IO, but if it finds a block
already in the buffer cache, it uses that in preference, waiting for IO
to complete on that block. Because of that IO stall, we can easily end
up executing the code above while the brw_page is still in progress. At
that point, we enter update_vm_cache(), which waits until the page read
is complete.

Unfortunately, at this point in time we have not yet set
buffer_uptodate, so brw_page just submits a new IO to read the data off
disk --- totally overwriting the newly updated contents of the
buffer_head we just wrote. The data write is simply dropped. This
behaviour is exactly what we have observed in a threaded pread/pwrite
test application.

There is also a more subtle bug which cannot simply be fixed just by
moving the mark_buffer_uptodate() to occur before the
update_vm_cache(). If we take a page fault during the copy_from_user,
then another process can do exactly the same thing, overwriting the
partial copy_from_user with old data from disk, because the buffer is
still not uptodate at that point. Unfortunately, we simply cannot mark
the buffer uptodate before the doing the copy_from_user(). This second
problem appears to have been around ever since Linux had a VM.

To fix both problems completely, we need to make the initialisation of a
new buffer from user space obey the same buffer locking synchronisation
as the initialisation of a buffer from disk. The patch below does this
for ext2.

For 2.3, we might want to update buffer.c to expose a new API through
which filesystems can perform that locking cleanly, but that doesn't
actually solve the problem completely: a lot of the work simply _has_ to
be done by the supplier of the new data, since the error recovery after
a partial copy_from_user to such a newly-created buffer has different
requirements from error recovery writing to an already-present buffer.

The patch (against 2.2.7) below fixes the data corruption seen by the
test case application, and should also close the page fault
copy_from_user race (which I have not been able to reproduce on demand).

--Stephen

----------------------------------------------------------------
--- fs/ext2/file.c~ Mon Dec 21 23:22:54 1998
+++ fs/ext2/file.c Tue Jun 1 19:05:02 1999
@@ -162,7 +162,7 @@
struct buffer_head * bh, *bufferlist[NBUF];
struct super_block * sb;
int err;
- int i,buffercount,write_error;
+ int i,buffercount,write_error, new_buffer;

/* POSIX: mtime/ctime may not change for 0 count */
if (!count)
@@ -247,30 +247,59 @@
}
if (c > count)
c = count;
- if (c != sb->s_blocksize && !buffer_uptodate(bh)) {
- ll_rw_block (READ, 1, &bh);
- wait_on_buffer (bh);
- if (!buffer_uptodate(bh)) {
- brelse (bh);
+
+ /* Tricky: what happens if we are writing the complete
+ * contents of a block which is not currently
+ * initialised? We have to obey the same
+ * synchronisation rules as the IO code, to prevent some
+ * other process from stomping on the buffer contents by
+ * refreshing them from disk while we are setting up the
+ * buffer. The copy_from_user() can page fault, after
+ * all. We also have to throw away partially successful
+ * copy_from_users to such buffers, since we can't trust
+ * the rest of the buffer_head in that case. --sct */
+
+ new_buffer = (!buffer_uptodate(bh) && !buffer_locked(bh) &&
+ c == sb->s_blocksize);
+
+ if (new_buffer) {
+ set_bit(BH_Lock, &bh->b_state);
+ c -= copy_from_user (bh->b_data + offset, buf, c);
+ if (c != sb->s_blocksize) {
+ c = 0;
+ unlock_buffer(bh);
+ brelse(bh);
if (!written)
- written = -EIO;
+ written = -EFAULT;
break;
}
+ mark_buffer_uptodate(bh, 1);
+ unlock_buffer(bh);
+ } else {
+ if (!buffer_uptodate(bh)) {
+ ll_rw_block (READ, 1, &bh);
+ wait_on_buffer (bh);
+ if (!buffer_uptodate(bh)) {
+ brelse (bh);
+ if (!written)
+ written = -EIO;
+ break;
+ }
+ }
+ c -= copy_from_user (bh->b_data + offset, buf, c);
}
- c -= copy_from_user (bh->b_data + offset, buf, c);
if (!c) {
brelse(bh);
if (!written)
written = -EFAULT;
break;
}
+ mark_buffer_dirty(bh, 0);
update_vm_cache(inode, pos, bh->b_data + offset, c);
pos += c;
written += c;
buf += c;
count -= c;
- mark_buffer_uptodate(bh, 1);
- mark_buffer_dirty(bh, 0);

if (filp->f_flags & O_SYNC)
bufferlist[buffercount++] = bh;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/