Re: [PATCH] xfs: snapshot current CIL sequence under xc_push_lock

From: Cen Zhang

Date: Thu May 07 2026 - 00:34:47 EST

Dear Chinner

>
> Which does not matter at all.
>
> Integrity in XFS is defined by completion-to-submission semantics.
> IOWs, the only thing that a CIL -force- guarantees is that it will
> ensure the current CIL context containing fully completed
> transactions when it is called.
>
> See, for example, __xfs_trans_commit() -> xfs_log_force() for
> XFS_TRANS_SYNC transactions.
>
> So if xlog_cil_force() races with a push switching to a new sequence
> number, it means that all the transactions that were completed at
> the time xfs_log_force() was called are -already being pushed-.
> Hence all we need to do is wait for the CIL context at that sequence
> to be committed. The new sequence number, by definition, contains
> modifications that completed -after- the log force was started, and
> so are outside the scope of the current force operation.
>
> IOWs, we can safely skip the new sequence number in teh case of a
> push/force race, and the force will still correctly wait on the old
> sequence number being forced to disk. The race is benign, and not a
> bug.
>
> As for xlog_cil_flush(), it is a latency reduction mechanism and not
> a integrity mechanism. We just don't care about racing, because the
> caller will call it again if sufficient progress wasn't made by the
> previous call.
>
> Hence I don't think there is a bug that needs fixing here. Yes, the
> fact that deducing it is ok requires understanding
> competion-to-submission integrity semantics, but that's kinda
> assumed knowledge because it's exactly the same semantics as we
> relyon for data/metadata ordering and integrity w/ O_SYNC/O_DSYNC
> operations...
>
>
> So the change is likely correct, but somewhat nasty in places.
> Regardless, I'm inclined to reject it based on "if it isn't broken,
> don't fix it" principles.
>
> In future, when reporting problems and fixes like this, it is very
> important that you say how the issue was found, whether it can be
> reproduced, hardware configs, etc. Presenting multiple fixes in a
> short period for a wide variety of different conditions in extremely
> complex code without any context of how/when they occur or what
> impact they have on systems smells like AI slop more than
> anything....
>
> So, please document what tools/AI are you using to find these
> problems and how they can be reproduced. I haven't commented on the
> other two "fixes" you've just posted because I haven't got hours to
> dig into them right now. You also posted a fix a month ago for a
> race condition that wasn't a race condition in buf log item
> handling, so something is clearly directing you at them....
>

Thanks for the detailed explanation.

You are right. My analysis of the XFS CIL checkpoint semantics was not
careful enough, and I overstated the possible effect of this window.

The report originally came from dynamic testing with a customized
syzkaller setup, which observed an unsynchronised access pattern at
runtime. I then tried to analyse whether that pattern could lead to a
real correctness problem, with some AI assistance during the reasoning
process. However, I did not validate that analysis sufficiently against
the actual XFS CIL semantics, and that led to an incorrect conclusion on
my side.

I am sorry for the noise and for taking your time on this.

In the past, when I sent more direct/raw reports from my testing, some of
them were reasonably treated as bot-like reports. Since then I have been
trying to better understand the reports before sending them, so that I
can submit more useful and actionable issues or patches. Clearly I still
got this wrong here.

I will take your comments seriously and be more careful in future
reports, especially when reasoning about subsystem-specific semantics
such as XFS log/CIL behavior.

Thanks again for the explanation, and sorry again for the noise.

Best regards
Cen