Re: [PATCH] jbd2: avoid mount failed when commit block is partial submitted

From: Theodore Ts'o
Date: Tue Apr 02 2024 - 23:38:16 EST


On Tue, Apr 02, 2024 at 03:42:40PM +0200, Jan Kara wrote:
> On Tue 02-04-24 17:09:51, Ye Bin wrote:
> > We encountered a problem that the file system could not be mounted in
> > the power-off scenario. The analysis of the file system mirror shows that
> > only part of the data is written to the last commit block.
> > To solve above issue, if commit block checksum is incorrect, check the next
> > block if has valid magic and transaction ID. If next block hasn't valid
> > magic or transaction ID then just drop the last transaction ignore checksum
> > error. Theoretically, the transaction ID maybe occur loopback, which may cause
> > the mounting failure.
> >
> > Signed-off-by: Ye Bin <yebin10@xxxxxxxxxx>
>
> So this is curious. The commit block data is fully within one sector and
> the expectation of the journaling is that either full sector or nothing is
> written. So what kind of storage were you using that it breaks these
> expectations?

I suppose if the physical sector size is 512 bytes, and the file
system block is 4k, I suppose it's possible that on a crash, that part
of the 4k commit block could be written. In *practice* though, this
is super rare. That's because on many modern HDD's, the physical
sector size is 4k (because the ECC overhead is much lower), even if
the logical sector size is 512 byte (for Windows 98 compatibility).
And even on HDD's where the physical sector size is really 512 bytes,
the way the sectors are laid out in a serpentine fashion, it is
*highly* likely that 4k write won't get torn.

And while this is *possible*, it's also possible that some kind of I/O
transfer error --- such as some bit flips which breaks the checksum on
the commit block, but also trashes the tid of the subsequent block,
such that your patch gets tricked into thinking that this is the
partial last commit, when in fact it's not the last commit, thus
causing the journal replay abort early. If that's case, it's much
safer to force fsck to be run to detect any inconsistency that might
result.

In general, I strongly recommend that fsck be run on the file system
before you try to mount it. Yeah, historically the root file system
gets mounted read-only, and then fsck gets run on it, and if
necessary, fsck will fix it up and then force a reboot. Ye, I'm
assuming that this is what you're doing, and so that's why you really
don't want the mount to fail?

If so, the better way to address this is to use an initramfs which can
run fsck on the real root file system, and then mount it, and then use
pivot_root and then exec'ing the real init program. That way, even
the journal is corrupted in that way, fsck will attempt to replay the
journal, fail, and you can have fsck do a forced fsck to fix up the
file system.

- Ted