Re: kernel BUG at linux-3.16.7-ckt25/fs/dcache.c:2373!

From: J. Bruce Fields
Date: Thu Jun 02 2016 - 16:02:09 EST


On Thu, Jun 02, 2016 at 09:51:27AM +0200, Philipp Hahn wrote:
> Hello,
>
> probably during heavy IO our file server crashed on a BUG_ON in dache.c,
> probably triggered by NFS:
>
> > ------------[ cut here ]------------
> > kernel BUG at /var/build/temp/tmp.BPql4ErveJ/pbuilder/linux-3.16.7-ckt25/fs/dcache.c:2373!
> > invalid opcode: 0000 [#1] SMP
> > Modules linked in: nfsv4 dns_resolver nfsv3 btrfs raid6_pq xor crc32c_generic ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> > s_acl auth_rpcgss nfs fscache lockd sunrpc quota_v2 quota_tree lpc_ich mfd_core psmouse i2c_i801 tpm_tis processor pcspkr serio_raw i2c_core tpm rng_core thermal_sys
> > rc16 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif crct10dif_common sg sr_mod cdrom ata_generic ehci_pci uhci_hcd ata_piix e1000 ehci_hcd mptspi mptscsih mptbase floppy scsi_transport_spi usbcore libata button usb_common
> > CPU: 0 PID: 1455 Comm: nfsd Not tainted 3.16.0-ucs193-amd64 #1 Debian 3.16.7-ckt25-2~bpo70+1~ucs3.3.193.201604181018
> > Hardware name: FUJITSU SIEMENS PRIMERGY RX200S2/D1790/M73IL, BIOS 6.0 Rev. R04A5F5.1790 01/16/2006
> > task: ffff880121cb03d0 ti: ffff8800d0cac000 task.ti: ffff8800d0cac000
> > RIP: 0010:[<ffffffff811d2613>] [<ffffffff811d2613>] __d_rehash+0x53/0x60
> > RSP: 0018:ffff8800d0cafaa0 EFLAGS: 00010282
> > RAX: 0000000000043e72 RBX: ffff8800939d4578 RCX: 000000000000000d
> > RDX: ffffc90000005000 RSI: ffffc90000224390 RDI: ffff8800b8d53018
> > RBP: ffff8800b8d53018 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0100000000000b15 R11: 0000000000000000 R12: ffff8800d1388198
> > R13: ffff8800b8d53018 R14: ffff8800b8d53b58 R15: ffff8800d0cafbe0
> > FS: 0000000000000000(0000) GS:ffff880127c00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: ffffffffff600400 CR3: 00000001007d7000 CR4: 00000000000007f0
> > Stack:
> > ffffffff811d2a3d ffff8800d0cafb27 ffff8800d1388198 ffff8800b8d53b58
> > ffff8800d1388198 ffff8800b8d53b58 ffff8800b8d53b58 ffff880122829860
> > ffff8800b8d53b58 ffff8800d0cafbe0 ffffffff811c44f9 ffff8800b8d53b58
> > Call Trace:
> > [<ffffffff811d2a3d>] ? d_materialise_unique+0x8d/0x3d0
> > [<ffffffff811c44f9>] ? lookup_real+0x19/0x50
> > [<ffffffff811c4b27>] ? __lookup_hash+0x37/0x50
> > [<ffffffff811c92fd>] ? lookup_one_len+0xbd/0x110
> > [<ffffffff81241159>] ? reconnect_path+0x109/0x2d0
> > [<ffffffffa0367e20>] ? nfsd_setuser_and_check_port+0xa0/0xa0 [nfsd]
> > [<ffffffff8124144f>] ? exportfs_decode_fh+0x12f/0x300
> > [<ffffffffa036cee5>] ? expkey_match+0x45/0x50 [nfsd]
> > [<ffffffffa02df314>] ? cache_check+0x114/0x3b0 [sunrpc]
> > [<ffffffffa036da5d>] ? exp_find+0xfd/0x1b0 [nfsd]
> > [<ffffffffa036898c>] ? fh_verify+0x2fc/0x5a0 [nfsd]
> > [<ffffffffa0378df3>] ? nfsd4_proc_compound+0x4c3/0x790 [nfsd]
> > [<ffffffffa0364df4>] ? nfsd_dispatch+0xe4/0x230 [nfsd]
> > [<ffffffffa02d3dae>] ? svc_process_common+0x33e/0x6a0 [sunrpc]
> > [<ffffffffa0364680>] ? nfsd_destroy+0x70/0x70 [nfsd]
> > [<ffffffffa02d44d0>] ? svc_process+0x110/0x160 [sunrpc]
> > [<ffffffffa0364747>] ? nfsd+0xc7/0x140 [nfsd]
> > [<ffffffff8108e141>] ? kthread+0xc1/0xe0
> > [<ffffffff8108e080>] ? flush_kthread_worker+0x80/0x80
> > [<ffffffff81572598>] ? ret_from_fork+0x58/0x90
> > [<ffffffff8108e080>] ? flush_kthread_worker+0x80/0x80
> > Code: 89 47 08 74 04 48 89 50 08 48 89 77 10 48 83 ca 01 48 89 16 0f ba 36 00 c3 0f 1f 80 00 00 00 00 f3 90 48 8b 06 a8 01 75 f7 eb b9 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47
> > RIP [<ffffffff811d2613>] __d_rehash+0x53/0x60
> > RSP <ffff8800d0cafaa0>
> > ---[ end trace 18587f42c106645d ]---
>
> There was a report 4 years ago where the same BUG_ON() was hit, but
> during some renames:
> <http://linux-fsdevel.vger.kernel.narkive.com/GcMw2p94/kernel-bug-at-fs-dcache-c-2105-d-rehash-bug-on-d-unhashed-entry>
>
>
> Our kernel is 3.16.7-ckt25 from Debian + ckt26 and ckt27 on top.

So this is the BUG_ON(!d_unhashed(..)) at the top of __d_rehash? I
don't know the -ckt kernels, and I don't have a BUG at line 2373 of
dcache.c, or something else?

There have been a lot of changes in the area since 3.16.7.

--b.

> I don't know exactly what our server was doing at that time, but we were
> setting up a Debian mirror at that time on "btrfs" on an iSCSI storage
> connected to that host.
> The host also imports 2 NFS shares from other servers and exports 3 NFS
> volumes itself.
>
>
> Last week-end the same server got stuck while being backed up:
>
> > BUG: soft lockup - CPU#1 stuck for 22s! [bacula-fd:16612]
> > Modules linked in: nfsv4 dns_resolver nfsv3 btrfs raid6_pq xor crc32c_generic ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc quota_v2 quota_tree lpc_ich mfd_core psmouse tpm_tis tpm i2c_i801 rng_core serio_raw processor i2c_core e752x_edac evdev shpchp edac_core pcspkr thermal_sys ext4 jbd2 crc16 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif crct10dif_common usb_storage sg sr_mod cdrom ata_generic floppy ehci_pci uhci_hcd ata_piix ehci_hcd mptspi mptscsih libata usbcore e1000 mptbase scsi_transport_spi button usb_common
> > CPU: 1 PID: 16612 Comm: bacula-fd Not tainted 3.16.0-ucs193-amd64 #1 Debian 3.16.7-ckt25-2~bpo70+1~ucs3.3.193.201604181018
> > Hardware name: FUJITSU SIEMENS PRIMERGY RX200S2/D1790/M73IL, BIOS 6.0 Rev. R04A5F5.1790 01/16/2006
> > task: ffff8800c6bf6ce0 ti: ffff8800732cc000 task.ti: ffff8800732cc000
> > RIP: 0010:[<ffffffff811d36c0>] [<ffffffff811d36c0>] __d_lookup_rcu+0x80/0x150
> > RSP: 0018:ffff8800732cfc58 EFLAGS: 00000282
> > RAX: 0000000000000000 RBX: 000102d000000000 RCX: 000000000000000d
> > RDX: ffffc90000005000 RSI: ffff8800732cfe10 RDI: ffff880012c2b918
> > RBP: ffff880012c2b918 R08: ffff880104804050 R09: 0000000000000023
> > R10: ffffffffffffffff R11: 7bfe9e189e400000 R12: ffff8800beb5f080
> > R13: ffff88011b2102e8 R14: ffffffff811a035e R15: 00000000000102d0
> > FS: 00007fa2e701f700(0000) GS:ffff880127c80000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: ffffffffff600400 CR3: 00000000bd005000 CR4: 00000000000007e0
> > Stack:
> > 0000000000000006 ffff880104804050 0000000000000000 ffffffff811c6449
> > ffff8800732cfcc0 ffff8800732cfd48 ffff8800732cfd38 0000000000000000
> > ffff880012c2b918 ffff880121f0e060 ffff8800732cfe00 ffffffff811c569e
> > Call Trace:
> > [<ffffffff811c6449>] ? link_path_walk+0x69/0x850
> > [<ffffffff811c569e>] ? lookup_fast+0x3e/0x2c0
> > [<ffffffff811c6d68>] ? path_lookupat+0x138/0x740
> > [<ffffffff814bda33>] ? tcp_sendmsg+0xd3/0xd20
> > [<ffffffff812138f5>] ? posix_acl_xattr_get+0x45/0xb0
> > [<ffffffff811c73a4>] ? filename_lookup+0x34/0xd0
> > [<ffffffff811c5cf1>] ? getname_flags.part.21+0x91/0x140
> > [<ffffffff811cb289>] ? user_path_at_empty+0x99/0x120
> > [<ffffffff811f7db4>] ? fsnotify+0x234/0x300
> > [<ffffffff811ba4cf>] ? do_sync_write+0x5f/0x90
> > [<ffffffff811bfc00>] ? vfs_fstatat+0x40/0x90
> > [<ffffffff811bfcc2>] ? SYSC_newlstat+0x12/0x30
> > [<ffffffff811bb33c>] ? vfs_write+0x16c/0x200
> > [<ffffffff811bb80c>] ? SyS_write+0x7c/0xb0
> > [<ffffffff8157264d>] ? system_call_fast_compare_end+0x10/0x15
> > Code: 00 00 00 4d 89 e9 49 c7 c2 ff ff ff ff 49 c1 e9 20 eb 14 0f 1f 84 00 00 00 00 00 48 8b 1b 48 85 db 0f 84 84 00 00 00 4c 8d 63 f8 <8b> 43 fc 48 39 6b 10 75 e7 48 83 7b 08 00 74 e0 89 c2 83 e2 fe
>
> The trace-back triggered by arch_trigger_all_cpu_backtrace() shows the
> following two functions in addition on top of link_path_walk():
>
> > [<ffffffff811d36c7>] ? __d_lookup_rcu+0x87/0x150
> > [<ffffffff811c5f3b>] ? __inode_permission+0x7b/0xd0
> > [<ffffffff811c6449>] ? link_path_walk+0x69/0x850
> ...
>
> Has someone seen similar problems with 3.16?
>
> If you need more info, just ask.
>
> Please cc: on replies as I'm not subscribed to linux-fsdevel (but LKML)
>
> Philipp
> --
> Philipp Hahn
> Open Source Software Engineer
>
> Univention GmbH
> be open.
> Mary-Somerville-Str. 1
> D-28359 Bremen
> Tel.: +49 421 22232-0
> Fax : +49 421 22232-99
> hahn@xxxxxxxxxxxxx
>
> http://www.univention.de/
> GeschÃftsfÃhrer: Peter H. Ganten
> HRB 20755 Amtsgericht Bremen
> Steuer-Nr.: 71-597-02876
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html