kernel BUG at linux-3.16.7-ckt25/fs/dcache.c:2373!

From: Philipp Hahn
Date: Thu Jun 02 2016 - 04:00:56 EST


Hello,

probably during heavy IO our file server crashed on a BUG_ON in dache.c,
probably triggered by NFS:

> ------------[ cut here ]------------
> kernel BUG at /var/build/temp/tmp.BPql4ErveJ/pbuilder/linux-3.16.7-ckt25/fs/dcache.c:2373!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: nfsv4 dns_resolver nfsv3 btrfs raid6_pq xor crc32c_generic ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
> s_acl auth_rpcgss nfs fscache lockd sunrpc quota_v2 quota_tree lpc_ich mfd_core psmouse i2c_i801 tpm_tis processor pcspkr serio_raw i2c_core tpm rng_core thermal_sys
> rc16 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif crct10dif_common sg sr_mod cdrom ata_generic ehci_pci uhci_hcd ata_piix e1000 ehci_hcd mptspi mptscsih mptbase floppy scsi_transport_spi usbcore libata button usb_common
> CPU: 0 PID: 1455 Comm: nfsd Not tainted 3.16.0-ucs193-amd64 #1 Debian 3.16.7-ckt25-2~bpo70+1~ucs3.3.193.201604181018
> Hardware name: FUJITSU SIEMENS PRIMERGY RX200S2/D1790/M73IL, BIOS 6.0 Rev. R04A5F5.1790 01/16/2006
> task: ffff880121cb03d0 ti: ffff8800d0cac000 task.ti: ffff8800d0cac000
> RIP: 0010:[<ffffffff811d2613>] [<ffffffff811d2613>] __d_rehash+0x53/0x60
> RSP: 0018:ffff8800d0cafaa0 EFLAGS: 00010282
> RAX: 0000000000043e72 RBX: ffff8800939d4578 RCX: 000000000000000d
> RDX: ffffc90000005000 RSI: ffffc90000224390 RDI: ffff8800b8d53018
> RBP: ffff8800b8d53018 R08: 0000000000000000 R09: 0000000000000000
> R10: 0100000000000b15 R11: 0000000000000000 R12: ffff8800d1388198
> R13: ffff8800b8d53018 R14: ffff8800b8d53b58 R15: ffff8800d0cafbe0
> FS: 0000000000000000(0000) GS:ffff880127c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffffffffff600400 CR3: 00000001007d7000 CR4: 00000000000007f0
> Stack:
> ffffffff811d2a3d ffff8800d0cafb27 ffff8800d1388198 ffff8800b8d53b58
> ffff8800d1388198 ffff8800b8d53b58 ffff8800b8d53b58 ffff880122829860
> ffff8800b8d53b58 ffff8800d0cafbe0 ffffffff811c44f9 ffff8800b8d53b58
> Call Trace:
> [<ffffffff811d2a3d>] ? d_materialise_unique+0x8d/0x3d0
> [<ffffffff811c44f9>] ? lookup_real+0x19/0x50
> [<ffffffff811c4b27>] ? __lookup_hash+0x37/0x50
> [<ffffffff811c92fd>] ? lookup_one_len+0xbd/0x110
> [<ffffffff81241159>] ? reconnect_path+0x109/0x2d0
> [<ffffffffa0367e20>] ? nfsd_setuser_and_check_port+0xa0/0xa0 [nfsd]
> [<ffffffff8124144f>] ? exportfs_decode_fh+0x12f/0x300
> [<ffffffffa036cee5>] ? expkey_match+0x45/0x50 [nfsd]
> [<ffffffffa02df314>] ? cache_check+0x114/0x3b0 [sunrpc]
> [<ffffffffa036da5d>] ? exp_find+0xfd/0x1b0 [nfsd]
> [<ffffffffa036898c>] ? fh_verify+0x2fc/0x5a0 [nfsd]
> [<ffffffffa0378df3>] ? nfsd4_proc_compound+0x4c3/0x790 [nfsd]
> [<ffffffffa0364df4>] ? nfsd_dispatch+0xe4/0x230 [nfsd]
> [<ffffffffa02d3dae>] ? svc_process_common+0x33e/0x6a0 [sunrpc]
> [<ffffffffa0364680>] ? nfsd_destroy+0x70/0x70 [nfsd]
> [<ffffffffa02d44d0>] ? svc_process+0x110/0x160 [sunrpc]
> [<ffffffffa0364747>] ? nfsd+0xc7/0x140 [nfsd]
> [<ffffffff8108e141>] ? kthread+0xc1/0xe0
> [<ffffffff8108e080>] ? flush_kthread_worker+0x80/0x80
> [<ffffffff81572598>] ? ret_from_fork+0x58/0x90
> [<ffffffff8108e080>] ? flush_kthread_worker+0x80/0x80
> Code: 89 47 08 74 04 48 89 50 08 48 89 77 10 48 83 ca 01 48 89 16 0f ba 36 00 c3 0f 1f 80 00 00 00 00 f3 90 48 8b 06 a8 01 75 f7 eb b9 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 47
> RIP [<ffffffff811d2613>] __d_rehash+0x53/0x60
> RSP <ffff8800d0cafaa0>
> ---[ end trace 18587f42c106645d ]---

There was a report 4 years ago where the same BUG_ON() was hit, but
during some renames:
<http://linux-fsdevel.vger.kernel.narkive.com/GcMw2p94/kernel-bug-at-fs-dcache-c-2105-d-rehash-bug-on-d-unhashed-entry>


Our kernel is 3.16.7-ckt25 from Debian + ckt26 and ckt27 on top.
I don't know exactly what our server was doing at that time, but we were
setting up a Debian mirror at that time on "btrfs" on an iSCSI storage
connected to that host.
The host also imports 2 NFS shares from other servers and exports 3 NFS
volumes itself.


Last week-end the same server got stuck while being backed up:

> BUG: soft lockup - CPU#1 stuck for 22s! [bacula-fd:16612]
> Modules linked in: nfsv4 dns_resolver nfsv3 btrfs raid6_pq xor crc32c_generic ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc quota_v2 quota_tree lpc_ich mfd_core psmouse tpm_tis tpm i2c_i801 rng_core serio_raw processor i2c_core e752x_edac evdev shpchp edac_core pcspkr thermal_sys ext4 jbd2 crc16 dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sd_mod crc_t10dif crct10dif_common usb_storage sg sr_mod cdrom ata_generic floppy ehci_pci uhci_hcd ata_piix ehci_hcd mptspi mptscsih libata usbcore e1000 mptbase scsi_transport_spi button usb_common
> CPU: 1 PID: 16612 Comm: bacula-fd Not tainted 3.16.0-ucs193-amd64 #1 Debian 3.16.7-ckt25-2~bpo70+1~ucs3.3.193.201604181018
> Hardware name: FUJITSU SIEMENS PRIMERGY RX200S2/D1790/M73IL, BIOS 6.0 Rev. R04A5F5.1790 01/16/2006
> task: ffff8800c6bf6ce0 ti: ffff8800732cc000 task.ti: ffff8800732cc000
> RIP: 0010:[<ffffffff811d36c0>] [<ffffffff811d36c0>] __d_lookup_rcu+0x80/0x150
> RSP: 0018:ffff8800732cfc58 EFLAGS: 00000282
> RAX: 0000000000000000 RBX: 000102d000000000 RCX: 000000000000000d
> RDX: ffffc90000005000 RSI: ffff8800732cfe10 RDI: ffff880012c2b918
> RBP: ffff880012c2b918 R08: ffff880104804050 R09: 0000000000000023
> R10: ffffffffffffffff R11: 7bfe9e189e400000 R12: ffff8800beb5f080
> R13: ffff88011b2102e8 R14: ffffffff811a035e R15: 00000000000102d0
> FS: 00007fa2e701f700(0000) GS:ffff880127c80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffff600400 CR3: 00000000bd005000 CR4: 00000000000007e0
> Stack:
> 0000000000000006 ffff880104804050 0000000000000000 ffffffff811c6449
> ffff8800732cfcc0 ffff8800732cfd48 ffff8800732cfd38 0000000000000000
> ffff880012c2b918 ffff880121f0e060 ffff8800732cfe00 ffffffff811c569e
> Call Trace:
> [<ffffffff811c6449>] ? link_path_walk+0x69/0x850
> [<ffffffff811c569e>] ? lookup_fast+0x3e/0x2c0
> [<ffffffff811c6d68>] ? path_lookupat+0x138/0x740
> [<ffffffff814bda33>] ? tcp_sendmsg+0xd3/0xd20
> [<ffffffff812138f5>] ? posix_acl_xattr_get+0x45/0xb0
> [<ffffffff811c73a4>] ? filename_lookup+0x34/0xd0
> [<ffffffff811c5cf1>] ? getname_flags.part.21+0x91/0x140
> [<ffffffff811cb289>] ? user_path_at_empty+0x99/0x120
> [<ffffffff811f7db4>] ? fsnotify+0x234/0x300
> [<ffffffff811ba4cf>] ? do_sync_write+0x5f/0x90
> [<ffffffff811bfc00>] ? vfs_fstatat+0x40/0x90
> [<ffffffff811bfcc2>] ? SYSC_newlstat+0x12/0x30
> [<ffffffff811bb33c>] ? vfs_write+0x16c/0x200
> [<ffffffff811bb80c>] ? SyS_write+0x7c/0xb0
> [<ffffffff8157264d>] ? system_call_fast_compare_end+0x10/0x15
> Code: 00 00 00 4d 89 e9 49 c7 c2 ff ff ff ff 49 c1 e9 20 eb 14 0f 1f 84 00 00 00 00 00 48 8b 1b 48 85 db 0f 84 84 00 00 00 4c 8d 63 f8 <8b> 43 fc 48 39 6b 10 75 e7 48 83 7b 08 00 74 e0 89 c2 83 e2 fe

The trace-back triggered by arch_trigger_all_cpu_backtrace() shows the
following two functions in addition on top of link_path_walk():

> [<ffffffff811d36c7>] ? __d_lookup_rcu+0x87/0x150
> [<ffffffff811c5f3b>] ? __inode_permission+0x7b/0xd0
> [<ffffffff811c6449>] ? link_path_walk+0x69/0x850
...

Has someone seen similar problems with 3.16?

If you need more info, just ask.

Please cc: on replies as I'm not subscribed to linux-fsdevel (but LKML)

Philipp
--
Philipp Hahn
Open Source Software Engineer

Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
Tel.: +49 421 22232-0
Fax : +49 421 22232-99
hahn@xxxxxxxxxxxxx

http://www.univention.de/
GeschÃftsfÃhrer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876