Re: [RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears

From: Jan Kara

Date: Mon Apr 27 2026 - 12:07:18 EST

On Mon 27-04-26 14:46:41, Matthew Wilcox wrote:
> On Mon, Apr 27, 2026 at 01:24:45PM +0200, Jan Kara wrote:
> > On Sun 26-04-26 03:42:38, Matthew Wilcox wrote:
> > > Why are we calling clear_buffer_uptodate() in end_buffer_async_read()?
> > > If the buffer is uptodate, we shouldn't be reading into it. If it's
> > > not uptodate, we don't need to clear the uptodate flag because it's
> > > already clear.
> > >
> > > I've been deleting calls to ClearPageUptodate and folio_clear_uptodate()
> > > from filesystems; it's almost always the wrong thing to do. But the
> > > buffer cache does have slightly different rules from the page cache,
> > > so this may not translate well.
> >
> > Right, I think the clear_buffer_uptodate() call in end_buffer_async_read()
> > is indeed superfluous and we can just remove it - possibly replace with
> > WARN_ON_ONCE(buffer_uptodate(bh)) - to deal with this race.
> >
> > That being said block_commit_write() could possibly race with
> > end_buffer_write_sync() as well and there the clear_buffer_uptodate()
> > should stay (well, unless we want to do a serious rework of IO error bh
> > handling). So the changes to block_commit_write() might still be needed.
>
> Well, I don't want to do a serious rework of io error bh handling, but
> I do have questions! Let's say we take this patch, so we do:
>
> lock_buffer(bh);
> set_buffer_uptodate(bh);
> mark_buffer_dirty(bh);
> unlock_buffer(bh);
>
> That means we don't hit the warning, but we do hit an error in
> end_buffer_write_sync() anyway:
>
> mark_buffer_write_io_error(bh);
> clear_buffer_uptodate(bh);
> unlock_buffer(bh);
>
> We now have a buffer which is marked as write_io_error, dirty and
> !uptodate. That's a logically incoherent state -- dirty means "is
> newer than what is on disc" and !uptodate means "on disc is newer
> than this buffer". I don't know if we have any assertions that
> check for this, but we probably should?

So I tend to agree with you that clearing buffer_uptodate bit on write IO
error (as much as it goes back many many years) is wrong. When I tried to
fix that some 20 years ago, Linus was against it arguing that uptodate
means "it matches the on-disk content unless dirty is set" so we just
worked-around it in ext3 / ext4. Fixing this was the "serious rework" I was
alluding to but these days we probably have less code depending on uptodate
bit getting cleared on write IO error than back then. So maybe it will be
mostly smooth.

> So I think what we need here is (in addition to this patch):
>
> +++ b/fs/buffer.c
> @@ -169,7 +169,8 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
> } else {
> buffer_io_error(bh, ", lost sync page write");
> mark_buffer_write_io_error(bh);
> - clear_buffer_uptodate(bh);
> + if (!buffer_dirty(bh)
> + clear_buffer_uptodate(bh);

Well, if we touch this code, I'd just remove the clearing altogether.

> }
> unlock_buffer(bh);
> put_bh(bh);
>
> This makes sense to me since we need to retry the write anyway.
>
> And of course, now I'm looking at buffer.c, I found another whoopsie.
> Remember 7375f22495e7 where we called put_bh() on a stack buffer
> after unlocking it? Don't we have the same problem in
> end_buffer_write_sync()? I can't find anyone calling it with a
> BH that's on the stack, but it's a bit of a nasty footgun.

Yes, I'm not aware of any such place either. Culprit of this is that struct
buffer_head is used both as a structure attached to a folio caching mapping
information (perfectly fine to submit for IO) and structure to pass around
mapping information which gets sometimes also abused to submit IO and then
there are issues. We are getting rid of this second use but it's slow.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR