Re: BUG: workqueue lockup (2)

From: Dmitry Vyukov
Date: Sun May 13 2018 - 10:35:58 EST


On Sun, May 13, 2018 at 4:29 PM, Tetsuo Handa
<penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:
> Eric Biggers wrote:
>> Generally it's best to close syzbot bug reports once the original cause is
>> fixed, so that syzbot can continue to report other bugs with the same signature.
>
> That's difficult to judge. Closing as soon as the original cause is fixed allows
> syzbot to try to report different reproducer for different bugs. But at the same time,
> different/similar bugs which were reported in that report (or comments in the discussion
> for that report) will become almost invisible from users (because users unlikely check
> other reports in already fixed bugs).
>
> An example is
>
> general protection fault in kernfs_kill_sb (2)
> https://syzkaller.appspot.com/bug?id=903af3e08fc7ec60e57d9c9b93b035f4fb038d9a
>
> where the cause of above report was already pointed out in the discussion for
> the below report.
>
> general protection fault in kernfs_kill_sb
> https://syzkaller.appspot.com/bug?id=d7db6ecf34f099248e4ff404cd381a19a4075653
>
> Since the latter is marked as "fixed on May 08 18:30", I worry that quite few
> users would check the relationship.
>
>> Note also that a "workqueue lockup" can be caused by almost anything in the
>> kernel, I think. This one for example is probably in the sound subsystem:
>> https://syzkaller.appspot.com/text?tag=CrashReport&x=1767232b800000
>>
>
> Right. Maybe we should not stop the test upon "workqueue lockup" message, for
> it is likely that the cause of lockup is that somebody is busy looping which
> should have been reported shortly as "rcu detected stall".
>
> Of course, there is possibility that "workqueue lockup" is reported because
> cond_resched() was used when explicit schedule_timeout_*() is required, which
> was the reason commit 82607adcf9cdf40f ("workqueue: implement lockup detector")
> was added.
>
> If we stop the test upon "workqueue lockup" message, maybe longer timeout (e.g.
> 300 seconds) is better so that rcu stall or hung task messages are reported
> if rcu stall or hung task is occurring.

Yes, we need order different stalls/lockups/hangs/etc according to
what can trigger what. E.g. rcu stall can trigger task hung and
workqueue lockup, but not the other way around.
There is https://github.com/google/syzkaller/issues/516 to track this.
But I did not yet have time to figure out all required changes.
If you have additional details, please add them there.