Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct

From: Brian G.
Date: Sat Feb 15 2020 - 21:25:10 EST


This is my first post to the LKML, so please be kind :) I also have
been affected by this bug. The bug is triggered whenever a write
happens to the filesystem, which means mounting read-only is an
available option to recover data. I took the time to do a full bisect
on the kernel sources and have identified the commit where the
breakage happens.

Regarding versions, I can confirm that 4.19.83 is stable with regards
to NILFS, and 4.19.84 and later are broken. I can also confirm that
5.3.10 works fine and have heard that 5.3.12 breaks NILFS as well. I
can also confirm that the 5.4.18 kernel still has this issue. I did
not trace how far back the issue goes on the 5.4.x series, or even in
more detail on the 5.3.x series.

To simplify my bisection task, I used the 4.19.x series, and
determined that commit d3b3c0a14615c495118acc4bdca23d53eea46ed2 is the
commit that breaks NILFS. Furthermore, when reverting this commit on
otherwise clean 4.19.84 kernel sources, the NILFS issue does not occur
anymore.

I'm not familiar enough with NILFS's internals to determine why the
small caching change to the kernel from that commit breaks NILFS, nor
can I offer a patch to fix it (besides reverting the offending change)
but I can confirm that this is the initial cause. I also know there
has been alot of new changes to kernel caching in more recent (5.4 /
5.5 / 5.6) kernels, so perhaps there is still more diagnostics to do.

I have the test VM that I used for bisection available if someone
wants to coordinate with me to put together a patch for this, but
ideally someone can take my diagnostics effort here and make use of it
directly. I saved dmesg logs from both good and bad cases and I can
send them if someone is interested. I can also provide some level of
detailed system setup instructions to reproduce the issue. I did my
testing against an existing external hard drive, but I have been able
to reproduce the issue consistently against a freshly created loopback
mount as well, so it is not just caused by disk corruption or an
unclean unmount.

- Brian

On Sat, Feb 15, 2020 at 8:11 PM ARAI Shun-ichi <hermes@xxxxxxxxxxxxxxx> wrote:
>
> And,
>
> In <20200210.224609.499887311281343618.hermes@xxxxxxxxxxxxxxx>;
> ARAI Shun-ichi <hermes@xxxxxxxxxxxxxxx> wrote
> as Subject "Re: BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8 in nilfs_segctor_do_construct":
>
> > Hi,
> >
> > FYI, reporting additional test results.
> >
> > I reproduced this problem with clean NILFS2 fs in previous mail.
> > "clean" means that "make filesystem before every tests."
> > In this mail, I tried to reproduct with/without VG/LV, LUKS, loopback.
> >
> > * Not reproduced
> > USB stick - primary partition - NILFS2
> > USB stick - primary partition - VG/LV - NILFS2
> > USB stick - primary partition - VG/LV - LUKS - NILFS2
> > USB stick - primary partition - LUKS - VG/LV - NILFS2
> > USB stick - primary partition - LUKS - VG/LV - LUKS - NILFS2
> > /tmp (tmpfs) - regular file - NILFS2 (loopback mount, kernel 4.19.82)
> > USB stick - primary partition(512MiB) - NILFS2
> >
> > * Reproduced (always, immediately)
> > /tmp (tmpfs) - regular file - NILFS2 (loopback mount)
> > USB stick - primary partition - ext4 - regular file - NILFS2 (loopback mount)
>
> this loopback problem is seen in Kernel 5.5.4.
>
> > Test conditions:
> > kernel 4.19.86 (same as previous test)
> > NILFS2/ext4 filesystem, VG/LV, LUKS were made with default parameters
> > size of "primary partition" in USB stick is approx. 14GiB
> > size of "regular file" is approx. 512MiB
> > "reproduce": mount NILFS2, touch file, sync