Re: [linus:master] [mm/mmap] 0503ea8f5b: kernel_BUG_at_mm/filemap.c

From: Oliver Sang
Date: Wed Mar 01 2023 - 02:32:32 EST


Hi Liam,

On Sun, Feb 26, 2023 at 10:17:56PM -0500, Liam R. Howlett wrote:
> * kernel test robot <oliver.sang@xxxxxxxxx> [230226 20:33]:
> >
> > Greeting,
> >
> > FYI, we noticed kernel_BUG_at_mm/filemap.c due to commit (built with gcc-11):
> >
> > commit: 0503ea8f5ba73eb3ab13a81c1eefbaf51405385a ("mm/mmap: remove __vma_adjust()")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > [test failed on linux-next/master 0222aa9800b25ff171d6dcabcabcd5c42c6ffc3f]
>
> If this is linux-next, shouldn't that mailing list be in the Cc list?

when this report was generated, the commit is already in mainline, so we don't
add linux-next mailing list.

when our auto-bisect captured a mainline commit as first bad commit for one
issue, it will also check latest mainline and latest linux-next at that time to
confirm the issue still exists, otherwise, auto-bisect will regard the issue
'has been fixed' then don't report.

above means the auto-bisect checked the "linux-next/master 0222aa9800b25", but
previously we won't list mainline check if there is already a linux-next check.
Since this seems cause confusion, now we changed a bit about this.
as in a new report also to you
[linus:master] [mm/mmap] 04241ffe3f: Kernel_panic-not_syncing:Fatal_exception
in
[1] https://lore.kernel.org/all/202303010946.d35666d1-oliver.sang@xxxxxxxxx/
(I will also mention this report later)
you may notice that change. do you think that is better or do you have other
suggests? Thanks a lot!

>
> >
> > in testcase: trinity
> > version: trinity-static-i386-x86_64-1c734c75-1_2020-01-06
> > with following parameters:
> >
> > runtime: 300s
> > group: group-04
> >
> > test-description: Trinity is a linux system call fuzz tester.
> > test-url: http://codemonkey.org.uk/projects/trinity/
> >
> >
> > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> >
> > please be noted, as below table, parent also has other type issues, and we
> > found they happen almost at same position of kernel_BUG_at_mm/filemap.c for
> > this commit if looking into dmesg (attached two parent dmesgs as well)
>
> I don't understand what you are saying with the above paragraph.
>
> I thought I understood that the bug happens in the previous commits and
> there is a dmesg from the previous two parents attached.. but when I
> look at the attached two dmesg, they are from the same commit and
> neither has anything to do with filemap.

Let me clarify. they are really from same commit, i.e. 287051b185048c4a,
the parent of 0503ea8f5ba73eb3ab13a81c1ee

maybe from below table, it could be clearer.

for both 0503ea8f5ba73eb3ab13a81c1ee and its parent, we tested 20 times.
then on parent, we observed "dmesg.RIP:anon_vma_clone" 9 times, observed
"dmesg.RIP:dup_anon_vma" 11 times, so I attached two parent dmesgs to
demostrate each case.

for 0503ea8f5ba73eb3ab13a81c1ee, we never observed those 2 issues, instead,
"dmesg.kernel_BUG_at_mm/filemap.c" shows in 9 runs out of 20.

287051b185048c4a 0503ea8f5ba73eb3ab13a81c1ee
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
9:20 -45% :20 dmesg.RIP:anon_vma_clone
11:20 -55% :20 dmesg.RIP:dup_anon_vma
:20 45% 9:20 dmesg.kernel_BUG_at_mm/filemap.c


>
> And why would we blame the later commit for the same bug?

this is due to our auto-bisect could only use some simple logic. in this case,
it finds the dmesg.kernel_BUG_at_mm/filemap.c first occurs on 0503ea8f5b and
not on parent, it has no knowledge if this issue is related with
RIP:anon_vma_clone or RIP:dup_anon_vma on parent (BTW, neither humans in our
team has), and this dmesg.kernel_BUG_at_mm/filemap.c still occcurs on latest
linux-next/master and mainline, so it think it captured a real 'fbc'.
we have a manual check step for report, that's the reason you saw above
paragrah you said which is confusing.

>
> Did something go wrong with the bisection?
>
> >
> > we don't have knowledge if this commit fixes some problem in parent then
> > run further until further issues, but since this commit touches
> > mm/filemap.c, we just made out this report FYI
>
> I changed one line in a comment in mm/filemap.c in the commit.
>
> -------------
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c915ded191f0..992554c18f1f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -97,7 +97,7 @@
> * ->i_pages lock (__sync_single_inode)
> *
> * ->i_mmap_rwsem
> - * ->anon_vma.lock (vma_adjust)
> + * ->anon_vma.lock (vma_merge)
> *
>
> -----------
>
> Are you sure about this bisection? I'm not saying it isn't my fault or
> looking to blame others, but I suspect we are indeed looking at the
> wrong commit here.


as above mentioned [1], now we made out a new report upon 04241ffe3f which seems
less confusing.

(BTW, a tricky thing is for report 04241ffe3f, the auto-bisect actually uses
"Kernel_panic-not_syncing:Fatal_exception" as the keyword to do bisection.
this "Kernel_panic-not_syncing:Fatal_exception" could be caused by both
(RIP:anon_vma_clone or RIP:dup_anon_vma) and kernel_BUG_at_mm/filemap.c.
if it uses RIP:anon_vma_clone or RIP:dup_anon_vma to do bisect, it will fail
since starting from 0503ea8f5b, there is only kernel_BUG_at_mm/filemap.c,
so it will think issue disppear in latest maineline or linux-next/master)


>
> >
> > BTW, we also noticed there is a fix commit
> > 07dc4b1862035 (" mm/mremap: fix dup_anon_vma() in vma_merge() case 4")
> > by further testing, BUG_at_mm/filemap.c is still existing there.
> >
> > +---------------------------------------------+------------+------------+
> > | | 287051b185 | 0503ea8f5b |
> > +---------------------------------------------+------------+------------+
> > | BUG:kernel_NULL_pointer_dereference,address | 11 | |
> > | Oops:#[##] | 11 | |
> > | RIP:dup_anon_vma | 11 | |
> > | Kernel_panic-not_syncing:Fatal_exception | 20 | 9 |
> > | canonical_address#:#[##] | 9 | |
> > | RIP:anon_vma_clone | 9 | |
> > | kernel_BUG_at_mm/filemap.c | 0 | 9 |
> > | invalid_opcode:#[##] | 0 | 9 |
> > | RIP:filemap_unaccount_folio | 0 | 9 |
> > +---------------------------------------------+------------+------------+
> >
>
> At what commit did problems start showing up regardless of what the
> problem was? I did not see any other emails from this bot since
> 2023-02-12, but clearly it is reporting problems with earlier commits
> considering the table above.
>
> >
> > If you fix the issue, kindly add following tag
> > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > | Link: https://lore.kernel.org/oe-lkp/202302252122.38b2139-oliver.sang@xxxxxxxxx
> >
> >
> > [ 28.065728][ T4983] ------------[ cut here ]------------
> > [ 28.066480][ T4983] kernel BUG at mm/filemap.c:153!
> > [ 28.067153][ T4983] invalid opcode: 0000 [#1] SMP PTI
> > [ 28.067868][ T4983] CPU: 0 PID: 4983 Comm: trinity-c3 Not tainted 6.2.0-rc4-00443-g0503ea8f5ba7 #1
> > [ 28.069001][ T4983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [ 28.072145][ T4983] RIP: 0010:filemap_unaccount_folio (filemap.c:?)
> > [ 28.072927][ T4983] Code: 89 fb 0f ba e0 10 72 05 8b 46 30 eb 0a 8b 46 58 85 c0 7f 07 8b 46 54 85 c0 78 11 48 c7 c6 a0 aa 24 82 48 89 ef e8 0b d2 02 00 <0f> 0b 48 89 ef e8 01 e7 ff ff be 13 00 00 00 48 89 ef 41 89 c4 41
> ...
>
> > [ 28.087701][ T4983] __filemap_remove_folio (??:?)
> > [ 28.088418][ T4983] ? unmap_mapping_range_tree (memory.c:?)
> > [ 28.089168][ T4983] ? mapping_can_writeback+0x5/0xc
> > [ 28.089940][ T4983] filemap_remove_folio (??:?)
> > [ 28.090627][ T4983] truncate_inode_folio (??:?)
> > [ 28.091342][ T4983] shmem_undo_range (shmem.c:?)
> > [ 28.092036][ T4983] shmem_truncate_range (??:?)
> > [ 28.092753][ T4983] shmem_fallocate (shmem.c:?)
> > [ 28.093444][ T4983] vfs_fallocate (??:?)
> > [ 28.094128][ T4983] madvise_vma_behavior (madvise.c:?)
> > [ 28.094874][ T4983] do_madvise (??:?)
> > [ 28.095491][ T4983] __ia32_sys_madvise (??:?)
> > [ 28.096166][ T4983] do_int80_syscall_32 (??:?)
> > [ 28.096885][ T4983] entry_INT80_compat (??:?)
>
> What happened to your line numbers? Didn't these show up before? They
> did on 2023-02-06 [1]

oh, this is an issue under investigation now.
I guess the report for 04241ffe3f is kind of more valid?
I also mentioned this problem there and will update you after we fix.

>
> ...
> >
> > To reproduce:
> >
> > # build kernel
> > cd linux
> > cp config-6.2.0-rc4-00443-g0503ea8f5ba7 .config
> > make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
> > make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> > cd <mod-install-dir>
> > find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
> >
> >
> > git clone https://github.com/intel/lkp-tests.git
> > cd lkp-tests
> > bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
> >
> > # if come across any failure that blocks the test,
> > # please remove ~/.lkp and /lkp dir to run from a clean state.
> >
>
> This does not work for me. Since my last use of lkp it seems something
> was changed and now -watchdog is not recognized by my qemu and so my
> attempts to reproduce this are failing. Is there a way to avoid using
> the -watchdog flag? Running the command by hand fails as it seems some
> files are removed on exit?

this is really related with a deprecated option '-watchdog'. (as Matthew Wilcox
also mentioned)

on my local platform:
$ qemu-system-x86_64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
$ qemu-system-x86_64 -watchdog i6300esb
qemu-system-x86_64: -watchdog i6300esb: warning: -watchdog is deprecated; use -device instead.

I guess you are using even newer version which deprecated this option already.
we've already pushed a fix:
https://github.com/intel/lkp-tests/commit/18c7a9c9fe0eebc4536220b97b10b47bfae00595

>
> I did try to remove the directories and run from a clean state, but it
> still fails for me. (see below)
>
>
> Thanks,
> Liam
>
> 1. https://lore.kernel.org/linux-mm/202302062208.24d3e563-oliver.sang@xxxxxxxxx/
>
> Log of failed lkp 68d76160fd7bb767c4a63e7709706b462c475e1b
> ======================
> x86_64
> ==> Making package: lkp-src 0-1 (Sun Feb 26 09:31:45 PM EST 2023)
> ==> Checking runtime dependencies...
> ==> Checking buildtime dependencies...
> ==> WARNING: Using existing $srcdir/ tree
> ==> Removing existing $pkgdir/ directory...
> ==> Starting build()...
> make: Entering directory '/home/jedix/lkp-tests/bin/event'
> klcc -D_FORTIFY_SOURCE=2 -c -o wakeup.o wakeup.c
> klcc -Wl,-O1,--sort-common,--as-needed,-z,relro -static -o wakeup wakeup.o
> rm -f wakeup.o
> strip wakeup
> make: Leaving directory '/home/jedix/lkp-tests/bin/event'
> ==> Entering fakeroot environment...
> x86_64
> ==> Starting package()...
> ==> Creating package "lkp-src"...
> 88466 blocks
> renamed '/home/jedix/.lkp/cache/lkp-x86_64.cgz.tmp' -> '/home/jedix/.lkp/cache/lkp-x86_64.cgz'
> ==> Leaving fakeroot environment.
> ==> Finished making: lkp-src 0-1 (Sun Feb 26 09:31:47 PM EST 2023)
> ~/lkp-tests
> 11 blocks
> result_root: /home/jedix/.lkp//result/trinity/group-04-300s/vm-snb/yocto-i386-minimal-20190520.cgz/x86_64-kexec/gcc-11/0503ea8f5ba73eb3ab13a81c1eefbaf51405385a/0
> downloading initrds ...
> use local modules: /home/jedix/.lkp/cache/modules.cgz
> /usr/bin/wget -q --timeout=1800 --tries=1 --local-encoding=UTF-8 https://download.01.org/0day-ci/lkp-qemu/osimage/yocto/yocto-i386-minimal-20190520.cgz -N -P /home/jedix/.lkp/cache/osimage/yocto
> 17916 blocks
> /usr/bin/wget -q --timeout=1800 --tries=1 --local-encoding=UTF-8 https://download.01.org/0day-ci/lkp-qemu/osimage/pkg/debian-x86_64-20180403.cgz/trinity-static-i386-x86_64-1c734c75-1_2020-01-06.cgz -N -P /home/jedix/.lkp/cache/osimage/pkg/debian-x86_64-20180403.cgz
> 43019 blocks
> exec command: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -fsdev local,id=test_dev,path=/home/jedix/.lkp//result/trinity/group-04-300s/vm-snb/yocto-i386-minimal-20190520.cgz/x86_64-kexec/gcc-11/0503ea8f5ba73eb3ab13a81c1eefbaf51405385a/0,security_model=none -device virtio-9p-pci,fsdev=test_dev,mount_tag=9p/virtfs_mount -kernel bzImage -append root=/dev/ram0 RESULT_ROOT=/result/trinity/group-04-300s/vm-snb/yocto-i386-minimal-20190520.cgz/x86_64-kexec/gcc-11/0503ea8f5ba73eb3ab13a81c1eefbaf51405385a/19 BOOT_IMAGE=/pkg/linux/x86_64-kexec/gcc-11/0503ea8f5ba73eb3ab13a81c1eefbaf51405385a/vmlinuz-6.2.0-rc4-00443-g0503ea8f5ba7 branch=linus/master job=/lkp/jobs/scheduled/vm-meta-102/trinity-group-04-300s-yocto-i386-minimal-20190520.cgz-0503ea8f5ba73eb3ab13a81c1eefbaf51405385a-20230224-7240-hzx70n-16.yaml user=lkp ARCH=x86_64 kconfig=x86_64-kexec commit=0503ea8f5ba73eb3ab13a81c1eefbaf51405385a initcall_debug nmi_watchdog=0 vmalloc=256M initramfs_async=0 page_owner=on max_uptime=1200 LKP_LOCAL_RUN=1 selinux=0 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw ip=dhcp result_service=9p/virtfs_mount -initrd /home/jedix/.lkp/cache/final_initrd -smp 2 -m 3419M -no-reboot -watchdog i6300esb -rtc base=localtime -device e1000,netdev=net0 -netdev user,id=net0 -display none -monitor null -serial stdio
> qemu-system-x86_64: -watchdog: invalid option
>
>
>