Re: [PATCH] ext3: prevent reread after write IO error v2

From: Jan Kara
Date: Thu Jan 14 2010 - 09:18:15 EST


On Thu 14-01-10 19:14:30, Hidehiro Kawai wrote:
> This patch fixes the similar bug fixed by commit 95450f5a.
>
> If a directory is modified, its data block is journaled as metadata
> and finally written back to the right place. Now, we assume a
> transient write erorr happens on that writeback. Uptodate flag of
> the buffer is cleared by write error, so next access on the buffer
> causes a reread from disk. This means it breaks the filesystems
> consistency.
>
> To prevent old directory data from being reread, this patch set
> uptodate flag again in the case of after write error before issuing
> the read operation. The write error on the directory's data block
> is detected at the time of journal checkpointing or discarded if a
> rewrite by another modification succeeds, so no problem.
>
> Similarly, this kind of consistency breakage can be caused by
> a transient write error on a bitmap block.
Good catch, that's indeed a problem.

> I tested this patch by using fault injection approach.
>
> By the way, I think the right fix is to keep uptodate flag on write
> error, but it gives a big impact. We have to confirm whether
> over 200 buffer_uptodate's are used for real uptodate check or write
> error check. For now, I adopt the quick-fix solution.
Yes that needs to be solved. I also looked into it and it's too much work
to do it in a one big sweep. But I think we could do the conversion
filesystem by filesystem - see below.
Admittedly, I don't like your solution very much. It looks strange to
check write_io_error when *reading* the buffer and I'm afraid we could
introduce bugs e.g. by clearing write_io_error bit so that ext3_bread would
then fail to detect the error condition or by some other code deciding to
read the buffer from disk via other function than just ext3_bread. So I
think it would be much better to set back uptodate flag shortly after the
failed write or not clear it at all.
Now here's how I think we could achieve that without having to change all
other filesystems: We could create a new superblock flag which would mean
that "this filesystem handles write_io_error and doesn't want to clear
uptodate flag". Filesystems with this capability would set this flag when
calling get_sb_bdev. And if write IO fails we check this flag
(via bh->b_bdev->bd_super->s_flags) and clear / not clear uptodate flag
accordingly. What do you think?
I know it's more work than your quick fix but it should fix all these
problems for ext3 once for all and it would be much cleaner...

Honza
>
> Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@xxxxxxxxxxx>
> ---
> fs/ext3/balloc.c | 12 ++++++++++++
> fs/ext3/inode.c | 13 +++++++++++++
> fs/ext3/namei.c | 15 ++++++++++++++-
> 3 files changed, 39 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> index 27967f9..5dc5ccf 100644
> --- a/fs/ext3/balloc.c
> +++ b/fs/ext3/balloc.c
> @@ -156,6 +156,18 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
> if (likely(bh_uptodate_or_lock(bh)))
> return bh;
>
> + /*
> + * uptodate flag may have been cleared by a previous (transient)
> + * write IO error. In this case, we don't want to reread its
> + * old on-disk data. Actually the buffer has the latest data,
> + * so set uptodate flag again.
> + */
> + if (buffer_write_io_error(bh)) {
> + set_buffer_uptodate(bh);
> + unlock_buffer(bh);
> + return bh;
> + }
> +
> if (bh_submit_read(bh) < 0) {
> brelse(bh);
> ext3_error(sb, __func__,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 455e6e6..67d7849 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1077,10 +1077,23 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
> return bh;
> if (buffer_uptodate(bh))
> return bh;
> +
> + /*
> + * uptodate flag may have been cleared by a previous (transient)
> + * write IO error. In this case, we don't want to reread its
> + * old on-disk data. Actually the buffer has the latest data,
> + * so set uptodate flag again.
> + */
> + if (buffer_write_io_error(bh)) {
> + set_buffer_uptodate(bh);
> + return bh;
> + }
> +
> ll_rw_block(READ_META, 1, &bh);
> wait_on_buffer(bh);
> if (buffer_uptodate(bh))
> return bh;
> +
> put_bh(bh);
> *err = -EIO;
> return NULL;
> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> index 7b0e44f..7ed8e45 100644
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -909,7 +909,20 @@ restart:
> num++;
> bh = ext3_getblk(NULL, dir, b++, 0, &err);
> bh_use[ra_max] = bh;
> - if (bh)
> + if (!bh || buffer_uptodate(bh))
> + continue;
> +
> + /*
> + * uptodate flag may have been cleared by a
> + * previous (transient) write IO error. In
> + * this case, we don't want to reread its
> + * old on-disk data. Actually the buffer
> + * has the latest data, so set uptodate flag
> + * again.
> + */
> + if (buffer_write_io_error(bh))
> + set_buffer_uptodate(bh);
> + else
> ll_rw_block(READ_META, 1, &bh);
> }
> }
> --
> 1.6.0.3
>
>
>
> --
> Hidehiro Kawai
> Hitachi, Systems Development Laboratory
> Linux Technology Center
>
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/