Re: [RFC] syzbot process

From: Dmitry Vyukov
Date: Thu Dec 28 2017 - 07:28:18 EST


On Thu, Dec 21, 2017 at 6:09 PM, Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> The thing is syzbot sucks. It tells us things are wrong but not how to
> reproduce the problem. Apparently syzbot will test fixes, but that
> doesn't help when more information is needed to track down the problem.
>
> The long of the short of it is that I don't care about about bug reports
> that no one can reproduce and no human cares about. syzbot doesn't care
> in the sense of helping to fix things, that things are broken. syzbot
> just cries like a baby "It's broken! It's broken!"
>
> Further syzbot is written in a language (go) that switches which kernel
> thread things run in at arbitrary times. That is absolutely not
> productive to understanding what is happening when things break. I have
> heard too many complaints from container run-times that they can't make
> what should be a couple of line change but is completely non-trivial
> because someone choose go for their implementation language. Whatever
> benefits go has it is not a programming lanauge I would choose for fine
> and reproducible control of kernel interfaces.
>
> This in addition to syzbot needing the latest and greatest version of go
> which is not packaged in a handy form by my distro.
>
> Which in my experenience makes syzbot a whining crybaby that won't do
> anything to help and fights you when you try and get close.


Hi Eric,

Re reproducers: that's not completely true. syzbot aims at providing
reproducers for reported bugs, and you can see 140 bug reports with
reproducers here:
https://groups.google.com/forum/#!searchin/syzkaller-bugs/%22reproducer$20is$20attached%22%7Csort:date
Unfortunately, localizing kernel bugs is hard and is not possible in
all cases. The root cause of this is actually in the kernel itself,
not in syzbot. Things would be much simpler if we would work on a
single-threaded, deterministic user-space library. Then we would get
preceise reproducers in 100% of cases. But kernel is a concurrent,
parallel, non-deterimnistic system that constantly accumulates state.
We do try to incrementally improve percent of cases where syzbot
manages to create reproducers in general and C reproducers in
particular. But that will never be 100% due to the nature of the
tested system.

Also, you seem to dealt with a single hard case. From what I see over
lots of hundreds of reported bugs, in ~2/3 of cases it's actually
possible to localize the bug looking at the crash report only (I see
that developers frequently don't even run the reproducer when it's
present). For example, LOCKDEP/KASAN reports frequently contain enough
context information to rootcause, lots of WARNING/BUG/GPFs are due to
simple, shallow bugs like missed input check or off-by-one, etc. So I
think it would be a mistake to not report bugs without reproducers.
Even if there is no reproducer and it's a hard bug with no obvious
cause, it happened and it would be wrong to hide this information from
the world and pretend that nothing happened. But I understand that the
bar for fixing bugs without reproducers is generally higher.

I've looked at the case you dealt with ("proc_flush_task oops").
syzbot has provided a syzkaller reproducer for it, and I was in fact
able to reproduce the crash running the reproducer. What happened
there is that reproducing the crash took ~15-20 mins, syzbot got a
lucky coin once when trying syzkaller program, but then when it tested
the corresponding C program it did not trigger the crash within
allotted time. In such cases syzbot decided to not mis-inform you that
the C program triggers the crash. Each report it provides is actual
kernel output obtained on a freshly booted machine running exact
reproducer it provides on exact kernel commit and config.

Re Go (implementation language): this is not true. The part of
syzkaller that actually executes syscalls is written in C++ from day
one. You can see the code here:
https://github.com/google/syzkaller/tree/master/executor
It does explicit, manual thread scheduling; compiled as static binary
to avoid any variance due to dynamic loading; does not use C++ runtime
support nor malloc to avoid unexpected mmap calls. This is mostly for
the reasons you outlined.

Re version of Go: yes, that's unfortunate. But there is no way we can
change this with limited human resources and without subscribing to
constant flow of maintanance work. Distros should provide more
up-to-date packages. For example, version of gcc that my distro
provides (4.8.4, about 5-years old) can't compile for arm at all. It
just can't produce binaries, compiler and assembler don't agree on
units of alignment directives. I don't see how we can realistically
work around cumulative amount of bugs in software over the last 5
years. Fortunately, obtaining a fresh Go toolchain boils down to
unpacking an archive from https://golang.org/dl/.

Please don't draw too broad conclusions from one/few negative cases
that you hit. syzkaller has found 1000+ real bugs in kernels. We are
doing our best. Problem domain is hard.

Thank you