Re: BUG: soft lockup in kvm_vm_ioctl

From: Dmitry Vyukov
Date: Thu May 09 2019 - 10:24:21 EST


> > > > Can the KVM maintainers take a look at this? This doesn't have anything to do
> > > > with my commit that syzbot bisected it to.
> > > >
> > > > +Dmitry, statistics lession: if a crash occurs only 1 in 10 times, as was the
> > > > case here, then often it will happen 0 in 10 times by chance. syzbot needs to
> > > > run the reproducer more times if it isn't working reliably. Otherwise it ends
> > > > up blaming some random commit.
> > >
> > > Added a note to https://github.com/google/syzkaller/issues/1051
> > > Thanks
> >
> > As we increase number of instances, we increase chances of hitting
> > unrelated bugs. E.g. take a look at the bisection log for:
> > https://syzkaller.appspot.com/bug?extid=f14868630901fc6151d3
> > What is the optimum number of tests is a good question. I suspect that
> > the current 10 instances is close to optimum. If we use significantly
> > more we may break every other bisection on unrelated bugs...
> >
>
> Only because syzbot is being super dumb in how it does the bisection. AFAICS,
> in the example you linked to, buggy kernels reliably crashed 10 out of 10 times
> with the original crash signature, "WARNING in cgroup_exit". Then at some point
> it tested some kernel without the bug and got a different crash just 1 in 10
> times, "WARNING: ODEBUG bug in netdev_freemem".
>
> The facts that the crash frequency was very different, and the crash signature
> was different, should be taken as a very strong signal that it's not the bug
> being bisected for. And this is something easily checked for in code.
>
> BTW, I hope you're treating fixing this as a high priority, given that syzbot is
> now sending bug reports to kernel developers literally selected at random. This
> is a great way to teach people to ignore syzbot reports. (When I suggested
> bisection originally, I had assumed you'd implement some basic sanity checks so
> that only bisection results likely to be reliable would be mailed out.)



While I believe we can get some quality improvement by shuffling
numbers. I don't think we can get significant improvement overall and
definitely not eliminate wrong bisection results entirely. It's easy
to take a single wrong bisection and design a system around this
scenario, but it's very hard to design a system that will handle all
of them in all generality. For example, look at these bisection logs
for cases where reproduction frequency varies from 1 to all, but
that's still the same bug:
https://syzkaller.appspot.com/x/bisect.txt?x=12df1ba3200000
https://syzkaller.appspot.com/x/bisect.txt?x=10daff1b200000
https://syzkaller.appspot.com/x/bisect.txt?x=1592b037200000
https://syzkaller.appspot.com/x/bisect.txt?x=11c610a7200000
https://syzkaller.appspot.com/x/bisect.txt?x=17affd1b200000
You also refer to "a different crash". But that's not a predicate we
can have. And definitely not something that is "easily checked for in
code". Consider, a function rename anywhere in the range will lead to
as if a different crash. If you look at all bisection logs you find
lots of amusing cases where something that a program may consider a
different bugs is actually the same bug, or the other way around. So
if we increase number of tests and we don't have a way to distinguish
crashes (which we don't), we will necessary increase incorrect results
due to unrelated bugs.

Bisection is a subtle process and the predicate, whatever logic it
does internally, in the end need to produce a single yes/no. And a
single wrong answer in the chain leads to a completely incorrect
result. There are some fundamental reasons for wrong results:
- hard to reproduce bugs (not fixable)
- unrelated bugs/broken builds (fixable)
While tuning numbers can pepper over these to some degree (maybe),
these reasons will stay and will lead to incorrect results. Also I
don't this tuning as something that is trivially to do as you suggest.
For example, how exactly do you assess a crash as reliably happening
vs episodically? How exactly do you choose number of tests for each
case? Choosing too few tests will lead to incorrect results, choosing
too many will lead to incorrect results. How exactly do you assess
that something that was happening reliably now does not happen
reliably? How do you assess that a crash is very different? Each of
the choices have chances of producing more bad results, so one would
need to rerun hundreds of bisections with old/new version, and then
manually mark results and then estimate quality change (which most
likely will be flaky or inconclusive in lots of cases). Tuning quality
of heuristics-based algorithms is very time consuming, especially if
each experiment takes weeks.

There is another down-side for not "super dumb" algorithms. Which is
explaining results. Consider that syzbot now mails a bisection where
the crash happened and a developer sees that it's the same crash, but
syzbot says "nope. did not crash". That will cause reasonable
questions and somebody (who would that be?) will need to come and
explain what happens and why, and how that counter-intuitive local
result was shown to improve quality overall. Simpler algorithms are
much easier to explain.

I consider bisection as high priority, but unfortunately only among
other high priority and very high priority work.
Besides work on the fuzzer itself and bug detection tools, we now test
15 kernels across 6 different OSes. Operational work can't be
deprioritized because then nothing will work at all. Change reviews
can't be deprioritized. Overseeing bug flow can't be deprioritized.
Updating crash parsing in response to new kernel output can't be
deprioritized. Answering all human emails can't be deprioritized.