Re: Linux 2.4.30-rc3 md/ext3 problems (ext3 gurus : please check)

From: Neil Brown
Date: Tue Mar 29 2005 - 23:09:09 EST


On Tuesday March 29, marcelo.tosatti@xxxxxxxxxxxx wrote:
>
> Attached is the backout patch, for convenience.

Thanks. I had another look, and think I may be able to see the
problem. If I'm right, it is a problem with this patch.

> diff -Nru a/fs/jbd/commit.c b/fs/jbd/commit.c
> --- a/fs/jbd/commit.c 2005-03-29 18:50:55 -03:00
> +++ b/fs/jbd/commit.c 2005-03-29 18:50:55 -03:00
> @@ -92,7 +92,7 @@
> struct buffer_head *wbuf[64];
> int bufs;
> int flags;
> - int err = 0;
> + int err;
> unsigned long blocknr;
> char *tagp = NULL;
> journal_header_t *header;
> @@ -299,8 +299,6 @@
> spin_unlock(&journal_datalist_lock);
> unlock_journal(journal);
> wait_on_buffer(bh);
> - if (unlikely(!buffer_uptodate(bh)))
> - err = -EIO;
> /* the journal_head may have been removed now */
> lock_journal(journal);
> goto write_out_data;


I think the "!buffer_update(bh)" test is not safe at this point as,
after the wait_on_buffer which could cause a schedule, the bh may
no longer exist, or be for the same block. There doesn't seem to be
any locking or refcounting that would keep it valid.

Note the comment "the journal_head may have been removed now".
If the journal_head is gone, the associated buffer_head is likely gone
as well.

I'm not certain that this is right, but it seems possible and would
explain the symptoms. Maybe Stephen or Andrew could comments?


> --- a/mm/filemap.c 2005-03-29 18:50:55 -03:00
> +++ b/mm/filemap.c 2005-03-29 18:50:55 -03:00
> @@ -3261,12 +3261,7 @@
> status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> }
>
> - /*
> - * generic_osync_inode always returns 0 or negative value.
> - * So 'status < written' is always true, and written should
> - * be returned if status >= 0.
> - */
> - err = (status < 0) ? status : written;
> + err = written ? written : status;
> out:
>
> return err;

As an aside, this looks extremely dubious to me.

There is a loop earlier in this routine (do_generic_file_write) that
passes a piece-at-a-time of the write request to prepare_write /
commit_write.
Successes are counted in 'written'. A failure causes the loop to
abort with a status in 'status'.

If some of the write succeeded and some failed, then I believe the
correct behaviour is to return the number of bytes that succeeded.
However this change to the return status (remember the above patch is
a reversal) causes any failure to over-ride any success. This, I
think, is wrong.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/