Re: BUG: corrupted list in freeary

From: Dmitry Vyukov
Date: Mon Dec 03 2018 - 09:53:22 EST

Next message: Timur Tabi: "Re: [PATCH] fbdev: fsl-diu: remove redundant null check on cmap"
Previous message: Daniel Lezcano: "Re: [PATCH] clocksource: riscv_timer: Provide sched_clock"
In reply to: Manfred Spraul: "Re: BUG: corrupted list in freeary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Dec 1, 2018 at 9:22 PM Manfred Spraul <manfred@xxxxxxxxxxxxxxxx> wrote:
>
> Hi Dmitry,
>
> On 11/30/18 6:58 PM, Dmitry Vyukov wrote:
> > On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> > <manfred@xxxxxxxxxxxxxxxx> wrote:
> >> Hello together,
> >>
> >> On 11/27/18 4:52 PM, syzbot wrote:
> >>
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit: e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> >> git tree: upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> [...]
> >> Isn't this a kernel stack overrun?
> >>
> >> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
> >> used up everything.
> > I don't exact answer, that's just the kernel output that we captured
> > from console.
> >
> > FWIW with KASAN stacks are 16K:
> > https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
> Ok, thanks. And stack overrun detection is enabled as well -> a real
> stack overrun is unlikely.
> > Well, generally everything except for kernel crashes is expected.
> >
> > We actually sandbox it with memcg quite aggressively:
> > https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> > But it seems to manage to either break the limits, or cause some
> > massive memory leaks. The nature of that is yet unknown.
>
> Is it possible to start from that side?
>
> Are there other syzcaller runs where the OOM killer triggers that much?

Lots of them:

https://groups.google.com/forum/#!searchin/syzkaller-upstream-moderation/lowmem_reserve
https://groups.google.com/forum/#!searchin/syzkaller-bugs/lowmem_reserve

But nobody got any hook on the reasons.

> >> - Which stress tests are enabled? By chance, I found:
> >>
> >> [ 433.304586] FAULT_INJECTION: forcing a failure.^M
> >> [ 433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
> >> times 0^M
> >> [ 433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
> >> #348^M
> >> [ 433.323841] Hardware name: Google Google Compute Engine/Google Compute
> >> Engine, BIOS Google 01/01/2011^M
> >>
> >> I need some more background, then I can review the code.
> > What exactly do you mean by "Which stress tests"?
> > Fault injection is enabled. Also random workload from userspace.
> >
> >
> >> Right now, I would put it into my "unknown syzcaller finding" folder.
>
> One more idea: Are there further syzcaller runs that end up with
> 0x010000 in a pointer?

Hard to say. syzbot triggered millions of crashes. I can't say that I
remember this as distinctive pattern that come up before.

> From what I see, the sysv sem code that is used is trivial, I don't see
> that it could cause the observed behavior.

I propose that we postpone further investigation of this until we have
a reproducer, or this happens more than once, or we gather some other
information.
Half of bugs are simple, so even for a crash happened once it makes
sense to spend 10 minutes looking at the code in case the root cause
is easy to spot. And hundreds of bugs were fixed this way. But I
assume you already did this.
The thing is that there are 100+ known bugs in kernel that lead to
memory corruptions:
https://syzkaller.appspot.com/#upstream-open
We try to catch them reliably with KASAN, but KASAN does not give 100%
guarantee. So if just one instance of a known bug gets unnoticed,
leads to a memory corruption, then later it can lead to an
unexplainable one-off crash like this. At this point higher ROI will
probably be from spending more time on hundreds of other known bugs
that have reproducers, happened lots of times, or just simpler. Once
we get rid of most of them, hopefully such unexplainable crashes will
go down too.

Next message: Timur Tabi: "Re: [PATCH] fbdev: fsl-diu: remove redundant null check on cmap"
Previous message: Daniel Lezcano: "Re: [PATCH] clocksource: riscv_timer: Provide sched_clock"
In reply to: Manfred Spraul: "Re: BUG: corrupted list in freeary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]