Re: BUG: corrupted list in freeary
From: Dmitry Vyukov
Date: Tue Mar 26 2019 - 04:47:22 EST
On Mon, Dec 3, 2018 at 3:53 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> On Sat, Dec 1, 2018 at 9:22 PM Manfred Spraul <manfred@xxxxxxxxxxxxxxxx> wrote:
> >
> > Hi Dmitry,
> >
> > On 11/30/18 6:58 PM, Dmitry Vyukov wrote:
> > > On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> > > <manfred@xxxxxxxxxxxxxxxx> wrote:
> > >> Hello together,
> > >>
> > >> On 11/27/18 4:52 PM, syzbot wrote:
> > >>
> > >> Hello,
> > >>
> > >> syzbot found the following crash on:
> > >>
> > >> HEAD commit: e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> > >> git tree: upstream
> > >> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> > [...]
> > >> Isn't this a kernel stack overrun?
> > >>
> > >> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
> > >> used up everything.
> > > I don't exact answer, that's just the kernel output that we captured
> > > from console.
> > >
> > > FWIW with KASAN stacks are 16K:
> > > https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
> > Ok, thanks. And stack overrun detection is enabled as well -> a real
> > stack overrun is unlikely.
> > > Well, generally everything except for kernel crashes is expected.
> > >
> > > We actually sandbox it with memcg quite aggressively:
> > > https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> > > But it seems to manage to either break the limits, or cause some
> > > massive memory leaks. The nature of that is yet unknown.
> >
> > Is it possible to start from that side?
> >
> > Are there other syzcaller runs where the OOM killer triggers that much?
>
> Lots of them:
>
> https://groups.google.com/forum/#!searchin/syzkaller-upstream-moderation/lowmem_reserve
> https://groups.google.com/forum/#!searchin/syzkaller-bugs/lowmem_reserve
>
> But nobody got any hook on the reasons.
>
>
> > >> - Which stress tests are enabled? By chance, I found:
> > >>
> > >> [ 433.304586] FAULT_INJECTION: forcing a failure.^M
> > >> [ 433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
> > >> times 0^M
> > >> [ 433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
> > >> #348^M
> > >> [ 433.323841] Hardware name: Google Google Compute Engine/Google Compute
> > >> Engine, BIOS Google 01/01/2011^M
> > >>
> > >> I need some more background, then I can review the code.
> > > What exactly do you mean by "Which stress tests"?
> > > Fault injection is enabled. Also random workload from userspace.
> > >
> > >
> > >> Right now, I would put it into my "unknown syzcaller finding" folder.
> >
> > One more idea: Are there further syzcaller runs that end up with
> > 0x010000 in a pointer?
>
> Hard to say. syzbot triggered millions of crashes. I can't say that I
> remember this as distinctive pattern that come up before.
>
> > From what I see, the sysv sem code that is used is trivial, I don't see
> > that it could cause the observed behavior.
>
> I propose that we postpone further investigation of this until we have
> a reproducer, or this happens more than once, or we gather some other
> information.
> Half of bugs are simple, so even for a crash happened once it makes
> sense to spend 10 minutes looking at the code in case the root cause
> is easy to spot. And hundreds of bugs were fixed this way. But I
> assume you already did this.
> The thing is that there are 100+ known bugs in kernel that lead to
> memory corruptions:
> https://syzkaller.appspot.com/#upstream-open
> We try to catch them reliably with KASAN, but KASAN does not give 100%
> guarantee. So if just one instance of a known bug gets unnoticed,
> leads to a memory corruption, then later it can lead to an
> unexplainable one-off crash like this. At this point higher ROI will
> probably be from spending more time on hundreds of other known bugs
> that have reproducers, happened lots of times, or just simpler. Once
> we get rid of most of them, hopefully such unexplainable crashes will
> go down too.
The working hypothesis for this bug is as follows.
semget provokes OOMs. OOMs now cause kernel stack overflow/corruption
in wb_workfn. So semget is kinda red herring.
Since we now sandbox tests processes with sem sysctl and friends, I
think we can close this report.
#syz invalid
Though the kernel memory corruption on OOMs is still there.