BUG_ON() in workingset_node_shadows_dec() triggers
From: Linus Torvalds
Date: Tue Oct 04 2016 - 00:01:12 EST
I'm really sorry I applied that last series from Andrew just before
doing the 4.8 release, because they cause problems, and now it is in
4.8 (and that buggy crap is marked for stable too).
In particular, I just got this
kernel BUG at ./include/linux/swap.h:276
and the end result was a dead kernel.
The bug that commit 22f2ac51b6d64 ("mm: workingset: fix crash in
shadow node shrinker caused by replace_page_cache_page()") purports to
have fixed has apparently been there since 3.15, but the fix is
clearly worse than the bug it tried to fix, since that original bug
has never killed my machine!
I should have reacted to the damn added BUG_ON() lines. I suspect I
will have to finally just remove the idiotic BUG_ON() concept once and
for all, because there is NO F*CKING EXCUSE to knowingly kill the
kernel.
Why the hell was that not a *warning*?
Yes, I'm grumpy. This went in very late in the release candidates, and
I had higher expectations of things coming in through Andrew. Adding
random BUG_ON()'s to code that clearly hasn't had sufficient testing
is *not* acceptable, and it's definitely not acceptable to send that
to me after rc8 unless it has gotten a *lot* of testing, which it
clearly must not have had. Adding stable to the cc too to warn about
this.
The full report is
kernel BUG at ./include/linux/swap.h:276!
invalid opcode: 0000 [#1] SMP
Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE
nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis
industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss
nfs_acl lockd grace sunrpc dm_crypt
CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
Hardware name: System manufacturer System Product Name/Z170-K, BIOS
1803 05/06/2016
task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
RIP: page_cache_tree_insert+0xf1/0x100
RSP: 0018:ffff8faa7f47bab0 EFLAGS: 00010046
RAX: 0000000000000001 RBX: ffff8faadfaf8c18 RCX: ffff8fa8737b5488
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fa8737b4b48
RBP: ffff8faa7f47bae8 R08: 0000000000000012 R09: ffff8fa8737b54b0
R10: 0000000000000040 R11: ffff8fa8737b54b0 R12: ffffea000b1ad580
R13: 0000000000000000 R14: ffff8faa7f47bb48 R15: ffffea000b1ad580
FS: 00007ffba3a61780(0000) GS:ffff8faaf6c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffba31a5430 CR3: 00000002c6d40000 CR4: 00000000003406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__add_to_page_cache_locked+0x12e/0x270
add_to_page_cache_lru+0x4e/0xe0
mpage_readpages+0x112/0x1d0
blkdev_readpages+0x1d/0x20
__do_page_cache_readahead+0x1ad/0x290
force_page_cache_readahead+0xaa/0x100
page_cache_sync_readahead+0x3f/0x50
generic_file_read_iter+0x5af/0x740
blkdev_read_iter+0x35/0x40
__vfs_read+0xe1/0x130
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x13/0x8f
Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48
83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f>
0b e8 88 68 ef ff 0f 1f 84 00
RIP page_cache_tree_insert+0xf1/0x100
and I hope somebody can see what is going wrong in there. The reason
the machine *dies* from that thing is that we end up then immediately
having a
BUG: unable to handle kernel paging request at ffffffffb70bdaa8
IP: blk_flush_plug_list+0x8b/0x250
Call Trace:
schedule+0x61/0x80
do_exit+0x8c8/0xae0
rewind_stack_do_exit+0x17/0x20
and then a
Fixing recursive fault but reboot is needed!
and the machine will never recover.
People who add random assert statements that kill machines should damn
well not be let near the VM layer.
Johannes? Please make this your first priority. And in the meantime I
will make that VM_BUG_ON() be a VM_WARN_ON_ONCE().
And dammit, if anybody else feels that they had done "debugging
messages with BUG_ON()", I would suggest you
(a) rethink your approach to programming
(b) send me patches to remove the crap entirely, or make them real
*DEBUGGING* messages, not "kill the whole machine" messages.
I've ranted against people using BUG_ON() for debugging in the past.
Why the f*ck does this still happen? And Andrew - please stop taking
those kinds of patches! Lookie here:
https://lwn.net/Articles/13183/
so excuse me for being upset that people still do this shit almost 15
years later.
Linus