Re: ext4 damage suspected in between 5.15.167 - 5.15.170
From: Nikolai Zhubr
Date: Sat Dec 14 2024 - 14:56:26 EST
Hi Ted,
On 12/13/24 19:12, Theodore Ts'o wrote:
stable@xxxxxxxxxx" to the commit description. However, they are not
obligated to do that, so there is an auxillary system which uses AI to
intuit which patches might be a bug fix. There is also automated
systems that try to automatically figure out which patches might be
Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for
pointing out.
Note that some hardware errors can be caused by one-off errors, such
as cosmic rays causing a bit-flip in memory DIMM. If that happens,
RAID won't save you, since the error was introduced before an updated
Certainly cosmic rays is a possibility, but based on previous episodes
I'd still rather bet on a more usual "subtle interaction" problem,
either exact same or some similar to [1].
I even tried to run an existing test for this particular case as
described in [2] but it is not too user-friendly and somehow exits
abnormally without actually doing any interesting work. I'll get back to
it later when I have some time.
[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
[2] https://lwn.net/Articles/954364/
The location of block allocation bitmaps never gets changed, so this
sort of thing only happens due to hardware-induced corruption.
Well, unless e.g. some modified sectors start being flushed to random
wrong offsets, like in [1] above, or something similar.
Looking at the dumpe2fs output, it looks like it was created
relatively recently (July 2024) but it doesn't have the metadata
checksum feature enabled, which has been enabled for quite a long
Yes. That was intentional - for better compatibility with even more
ancient stuff. Maybe time has come to reconsider the approach though.
You got lucky because it block allocation bitmap location was
corrupted to an obviously invalid value. But if it had been a
Absolutely. I was really amazed when I realized that :-)
It saved me days or even weeks of unnecessary verification work.
Otherwise, I strongly encourage you to learn, and to take
responsibility for the health of your own system. And ideally, you
can also use that knowledge to help other users out, which is the only
way the free-as-in-beer ecosystem can flurish; by having everybody
True. Generally I try to follow that, as much as appears possible.
It is sad a direct communication end-user-to-developer for solving
issues is becoming increasingly problematic here.
Anyway, thank you for friendly speech, useful hints and good references!
Regards,
Nick
helping each other. Who knows, maybe you could even get a job doing
it for a living. :-) :-) :-)
Cheers,