Re: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears

From: Matthew Wilcox

Date: Mon Apr 27 2026 - 09:51:37 EST

On Mon, Apr 27, 2026 at 01:24:45PM +0200, Jan Kara wrote:
> On Sun 26-04-26 03:42:38, Matthew Wilcox wrote:
> > Why are we calling clear_buffer_uptodate() in end_buffer_async_read()?
> > If the buffer is uptodate, we shouldn't be reading into it. If it's
> > not uptodate, we don't need to clear the uptodate flag because it's
> > already clear.
> >
> > I've been deleting calls to ClearPageUptodate and folio_clear_uptodate()
> > from filesystems; it's almost always the wrong thing to do. But the
> > buffer cache does have slightly different rules from the page cache,
> > so this may not translate well.
>
> Right, I think the clear_buffer_uptodate() call in end_buffer_async_read()
> is indeed superfluous and we can just remove it - possibly replace with
> WARN_ON_ONCE(buffer_uptodate(bh)) - to deal with this race.
>
> That being said block_commit_write() could possibly race with
> end_buffer_write_sync() as well and there the clear_buffer_uptodate()
> should stay (well, unless we want to do a serious rework of IO error bh
> handling). So the changes to block_commit_write() might still be needed.

Well, I don't want to do a serious rework of io error bh handling, but
I do have questions! Let's say we take this patch, so we do:

lock_buffer(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
unlock_buffer(bh);

That means we don't hit the warning, but we do hit an error in
end_buffer_write_sync() anyway:

mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
unlock_buffer(bh);

We now have a buffer which is marked as write_io_error, dirty and
!uptodate. That's a logically incoherent state -- dirty means "is
newer than what is on disc" and !uptodate means "on disc is newer
than this buffer". I don't know if we have any assertions that
check for this, but we probably should?

So I think what we need here is (in addition to this patch):

+++ b/fs/buffer.c
@@ -169,7 +169,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
} else {
buffer_io_error(bh, ", lost sync page write");
mark_buffer_write_io_error(bh);
- clear_buffer_uptodate(bh);
+ if (!buffer_dirty(bh)
+ clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
put_bh(bh);

This makes sense to me since we need to retry the write anyway.

And of course, now I'm looking at buffer.c, I found another whoopsie.
Remember 7375f22495e7 where we called put_bh() on a stack buffer
after unlocking it? Don't we have the same problem in
end_buffer_write_sync()? I can't find anyone calling it with a
BH that's on the stack, but it's a bit of a nasty footgun.