Re: ext4 damage suspected in between 5.15.167 - 5.15.170

From: Nikolai Zhubr
Date: Sat Dec 14 2024 - 14:56:26 EST


Hi Ted,

On 12/13/24 19:12, Theodore Ts'o wrote:
stable@xxxxxxxxxx" to the commit description. However, they are not
obligated to do that, so there is an auxillary system which uses AI to
intuit which patches might be a bug fix. There is also automated
systems that try to automatically figure out which patches might be

Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for pointing out.

Note that some hardware errors can be caused by one-off errors, such
as cosmic rays causing a bit-flip in memory DIMM. If that happens,
RAID won't save you, since the error was introduced before an updated

Certainly cosmic rays is a possibility, but based on previous episodes I'd still rather bet on a more usual "subtle interaction" problem, either exact same or some similar to [1].
I even tried to run an existing test for this particular case as described in [2] but it is not too user-friendly and somehow exits abnormally without actually doing any interesting work. I'll get back to it later when I have some time.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
[2] https://lwn.net/Articles/954364/

The location of block allocation bitmaps never gets changed, so this
sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random wrong offsets, like in [1] above, or something similar.

Looking at the dumpe2fs output, it looks like it was created
relatively recently (July 2024) but it doesn't have the metadata
checksum feature enabled, which has been enabled for quite a long

Yes. That was intentional - for better compatibility with even more ancient stuff. Maybe time has come to reconsider the approach though.

You got lucky because it block allocation bitmap location was
corrupted to an obviously invalid value. But if it had been a

Absolutely. I was really amazed when I realized that :-)
It saved me days or even weeks of unnecessary verification work.

Otherwise, I strongly encourage you to learn, and to take
responsibility for the health of your own system. And ideally, you
can also use that knowledge to help other users out, which is the only
way the free-as-in-beer ecosystem can flurish; by having everybody

True. Generally I try to follow that, as much as appears possible.
It is sad a direct communication end-user-to-developer for solving issues is becoming increasingly problematic here.
Anyway, thank you for friendly speech, useful hints and good references!

Regards,

Nick

helping each other. Who knows, maybe you could even get a job doing
it for a living. :-) :-) :-)

Cheers,