Re: slab corruption with current -git (was Re: [git pull] vfs pile 1 (splice))

From: Linus Torvalds
Date: Mon Oct 10 2016 - 12:30:37 EST


On Mon, Oct 10, 2016 at 6:49 AM, Aaron Conole <aconole@xxxxxxxxxx> wrote:
>
> Okay, I'm looking it over. Sorry for the mess.

So as I already answered to Dave, I'm not actually sure that this was
the buggy code, or that my patch would make any difference at all.

I never got a good reproducer for the bug: I spent much of the weekend
rebooting, because it seems to happen only just after a reboot, as I
log in and start my usual thing.

I initially blamed some off filesystem or block layer issue ("Oh, it
only happens with a cold cache"), partly because the initial
non-poisoned slub oopses happened in filesystem code.

But I now think it's netfilter, and I *think* that what triggers it is
something like the bluetooth subsystem giving up or something. What I
do when I log into a new session tends to be to go to the kernel
subdirectory in one or two terminals, and fire up chrome to read
email. And the problem either happened within half a minute of me
doing that, or it never happens at all.

Which is why I ended up rebooting a *lot*. Just running the kernel
never triggered it.

(It took me some time to figure that out, which is basically why I did
almost no pull requests the whole weekend)

The journal entries for that invalid kernel access is somewhat suggestive:

Oct 09 13:24:03 i7 dbus-daemon[1030]: [system] Failed to activate
service 'org.bluez': timed out

Oct 09 13:24:09 i7 audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0
msg='unit=systemd-hostnamed comm="systemd"
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
res=success'

Oct 09 13:24:09 i7 kernel: general protection fault: 0000 [#1] SMP

so it happened just as *some* network setup thing was finishing off (I
don't think it was systemd-hostnamed itself that necessarily matters,
but clearly something was finishing up as the netfilter problem
occurred.

> I'll review it, and test it. Can you tell me what steps you took to
> reproduce the oops?

See above: I can't actually really "reproduce" it. It's probably
highly timing-dependent, and it is not unlikely that it's also very
much about specific setup. I'm running plain Fedora 24, I boot up, log
in, start two or three terminals, fire up chrome, and ...

So far I've seen the problem maybe 5-6 times, but a couple of those
were just silent hangs (I may have rebooted too quickly for things to
hit the disk, or the oops may just have killed the machine too hard).
Two I got the oops inside slub code, and I only have one successful
slub poisoning oops from netfilter.

(Part of the reason I only have one is that once I got that, I stopped
rebooting, and instead started looking at the netfilter code and
started to do some merge window pulls again because I felt that this
is *probably* the core reason, and I cant' afford to not do pulls
during the merge window for _too_ long).

Linus