Re: [PATCH] xfs: annotate lockless bli_flags access in buf item paths

From: Cen Zhang

Date: Sun Mar 29 2026 - 22:51:28 EST

Hi Dave,

> When XFS_BLI_STALE is set in unpin, then by definition the buffer
> must be locked and the BLI referenced. [...]
> Hence we only care about the bli_flags when unpin has the only
> remaining reference to the bli, and in that case the read cannot be
> racing with anything else changing it's state.

Thank you for the detailed walkthrough. You're right about several
things: READ_ONCE()/WRITE_ONCE() was the wrong tool here, the
WRITE_ONCE in xfs_buf_item_release() is useless since it's under
the buffer lock, and the tracepoint change doesn't matter. I'm
dropping those.

However, I want to share the actual concurrent access evidence
before we decide on the remaining two reads:

(1) xfs_buf_item_unpin:510 reads bli_flags
(2) xfs_buf_item_committed:803 reads bli_flags

Both race with two different write paths:

Write A: xfs_trans_dirty_buf:532 (bli_flags |= DIRTY|LOGGED)
via new transaction on the same buffer
Write B: xfs_buf_item_release:713 (bli_flags &= ~LOGGED|HOLD|ORDERED)
via iop_committing on xlog_cil_commit path

The concurrent paths are:
CPU 0: old checkpoint completing on workqueue
→ xlog_cil_committed → iop_committed / iop_unpin
CPU 1: new transaction with the same buffer committing
→ xlog_cil_commit → iop_committing → iop_release
(or xfs_trans_dirty_buf during the transaction)

The comment at xlog_cil.c:1856 even acknowledges this:
"the CIL checkpoint can race with us and we can run checkpoint
completion before we've updated and unlocked the log items"

> IOWs, using READ_ONCE to indicate a data access race is -incorrect-
> in the cases where we actually use the result of the read.

Agreed. For xfs_buf_item_unpin(), you're right that stale is
only consumed on the last-reference path where no concurrent writer
exists. But the read at line 510 happens unconditionally _before_
atomic_dec_and_test(), so on the non-last-reference path it does
race with the writes listed above. The value is discarded in that
case, so there's no semantic issue.

Rather than annotating a race whose result is unused, a cleaner fix
is to move the stale read to after the freed check, so the read
only occurs when we actually need the value -- at which point there
is no race:

freed = atomic_dec_and_test(&bip->bli_refcount);
...
if (!freed) { xfs_buf_rele(bp); return; }
+ stale = bip->bli_flags & XFS_BLI_STALE;

This eliminates the concurrent access entirely with no behavior
change.

> There is no consequential race here, either. Once an BLI is marked
> as an XFS_BLI_INODE_ALLOC_BUF under the buffer lock [...] that
> flag never gets cleared from bip->bli_flags.

Correct -- the bit being read (INODE_ALLOC_BUF) is never modified
by the concurrent writers, so the result is always correct. But
the concurrent writes to other bits in the same word (Write A and
Write B above) do constitute a data access race on that memory
location. Since the race is confirmed
benign, data_race() is the right annotation.

- if ((bip->bli_flags & XFS_BLI_INODE_ALLOC_BUF) && ...)
+ if ((data_race(bip->bli_flags) & XFS_BLI_INODE_ALLOC_BUF) && ...)

So the v2 would be:
- Move stale read in xfs_buf_item_unpin() after freed check
- data_race() in xfs_buf_item_committed()
- Drop everything else (WRITE_ONCE, tracepoint, release-side)

Does this sound reasonable?

Thanks,
Cen