Re: WARNING in get_pi_state
From: Dmitry Vyukov
Date: Tue Oct 31 2017 - 06:23:40 EST
On Tue, Oct 31, 2017 at 1:21 PM, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
> On Tue, Oct 31, 2017 at 1:08 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> On Tue, Oct 31, 2017 at 12:29:50PM +0300, Dmitry Vyukov wrote:
>>> I understand your sentiment, but it's definitely not _at all_. The
>>> system compiled this exact code, run it and triggered the bug on it.
>>> Do you have suggestions on how to make this code more portable? How
>>> does this setup would look on your system?
>>
>> So I don't see the point of that tun stuff; what was is supposed to do?
>>
>> All it ever did after creation was flush_tun(), which reads until empty.
>> But given nobody would ever write into it, that's an 'expensive' NO-OP.
>
> See the text below.
> It does try to minimize both programs and features used (e.g. also
> these clunky NONFAILING macros, and filesystem business). But if it
> takes 100 seconds to reproduce, then it's hard to do minimization.
> Consider that you are trying to bisect such bugs, that also will be
> hard and unreliable, and you can get a wrong commit in the end.
>
> See this for an example for much more tidy reproducer:
> https://groups.google.com/forum/#!topic/syzkaller-bugs/9nYn7hpNpEk
> But that's a single threaded bug that instantly triggers each time you
> run the program.
But having said that, the tun code is not supposed to make the
reproducer non-working either. E.g. on our systems it just setups tun
successfully and then proceeds to the actual code that triggers the
problem. What's the failure mode with tun code on your system? If we
make it more portable, then such repros will work on your system as
well.
>>> We do try hard to get rid of unnecessary stuff in reproducers. I think
>>> what happened in this case is the following. This is a hard to
>>> reproduce race. The bot was able to reproduce the crash on initial
>>> program that uses tun, then tried to get rid of tun code and
>>> re-reproduce it, but it did not reproduce this time, so it concluded
>>> that tun code is somehow necessary here. That's unfortunate
>>> consequence of testing complex concurrent code. May become somewhat
>>> better once we have KTSAN, the race detector.
>>
>> I ripped out the tun bits and it reproduced in ~100 seconds. I've now
>> got it running for well over 30m on the fixed kernel while I'm trying to
>> come up with a comprehensible Changelog ;-)