Re: [PATCH] jbd2: avoid mount failed when commit block is partial submitted

From: yebin (H)
Date: Sat Apr 06 2024 - 21:37:49 EST




On 2024/4/3 18:11, Jan Kara wrote:
On Tue 02-04-24 23:37:42, Theodore Ts'o wrote:
On Tue, Apr 02, 2024 at 03:42:40PM +0200, Jan Kara wrote:
On Tue 02-04-24 17:09:51, Ye Bin wrote:
We encountered a problem that the file system could not be mounted in
the power-off scenario. The analysis of the file system mirror shows that
only part of the data is written to the last commit block.
To solve above issue, if commit block checksum is incorrect, check the next
block if has valid magic and transaction ID. If next block hasn't valid
magic or transaction ID then just drop the last transaction ignore checksum
error. Theoretically, the transaction ID maybe occur loopback, which may cause
the mounting failure.

Signed-off-by: Ye Bin <yebin10@xxxxxxxxxx>
So this is curious. The commit block data is fully within one sector and
the expectation of the journaling is that either full sector or nothing is
written. So what kind of storage were you using that it breaks these
expectations?
I suppose if the physical sector size is 512 bytes, and the file
system block is 4k, I suppose it's possible that on a crash, that part
of the 4k commit block could be written.
I was thinking about that as well but the commit block looks like:

truct commit_header {
__be32 h_magic;
__be32 h_blocktype;
__be32 h_sequence;
unsigned char h_chksum_type;
unsigned char h_chksum_size;
unsigned char h_padding[2];
__be32 h_chksum[JBD2_CHECKSUM_BYTES];
__be64 h_commit_sec;
__be32 h_commit_nsec;
};

where JBD2_CHECKSUM_BYTES is 8. So all the data in the commit block
including the checksum is in the first 60 bytes. Hence I would be really
surprised if some storage can tear that...
This issue has been encountered a few times in the context of eMMC devices. The vendor
has confirmed that only 512-byte atomicity can be ensured in the firmware.
Although the valid data is only 60 bytes, the entire commit block is used for calculating
the checksum.
jbd2_commit_block_csum_verify:
..
calculated = jbd2_chksum(j, j->j_csum_seed, buf, j->j_blocksize);
..

Hence either Ye Bin is running on some really exotic storage or the storage
/ CPU in fact flipped bits somewhere so that the checksum didn't match or
the commit block was in fact not written now but it was a leftover from
previous journal use and h_sequence happened to match. Very unlikely but
depending on how exactly they do their powerfail testing I can imagine it
would be possible with enough tries...

In *practice* though, this
is super rare. That's because on many modern HDD's, the physical
sector size is 4k (because the ECC overhead is much lower), even if
the logical sector size is 512 byte (for Windows 98 compatibility).
And even on HDD's where the physical sector size is really 512 bytes,
the way the sectors are laid out in a serpentine fashion, it is
*highly* likely that 4k write won't get torn.

And while this is *possible*, it's also possible that some kind of I/O
transfer error --- such as some bit flips which breaks the checksum on
the commit block, but also trashes the tid of the subsequent block,
such that your patch gets tricked into thinking that this is the
partial last commit, when in fact it's not the last commit, thus
causing the journal replay abort early. If that's case, it's much
safer to force fsck to be run to detect any inconsistency that might
result.
Yeah, I agree in these cases of a corrupted journal it seems dangerous to
just try to continue without fsck based on some heuristics.

Honza