Re: INFO: rcu detected stall in dummy_timer

From: Andrey Konovalov
Date: Wed Sep 18 2019 - 07:23:03 EST


On Mon, Sep 16, 2019 at 6:32 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Mon, 16 Sep 2019, Andrey Konovalov wrote:
>
> > On Fri, Sep 13, 2019 at 10:35 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, 13 Sep 2019, syzbot wrote:
> > >
> > > > syzbot has found a reproducer for the following crash on:
> > > >
> > > > HEAD commit: f0df5c1b usb-fuzzer: main usb gadget fuzzer driver
> > > > git tree: https://github.com/google/kasan.git usb-fuzzer
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1146550d600000
> > > > kernel config: https://syzkaller.appspot.com/x/.config?x=5c6633fa4ed00be5
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b24d736f18a1541ad550
> > > > compiler: gcc (GCC) 9.0.0 20181231 (experimental)
> > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11203fa5600000
> > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=162cd335600000
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > Reported-by: syzbot+b24d736f18a1541ad550@xxxxxxxxxxxxxxxxxxxxxxxxx
> > > >
> > > > yurex 3-1:0.101: yurex_interrupt - unknown status received: -71
> > > > yurex 5-1:0.101: yurex_interrupt - unknown status received: -71
> > > > yurex 6-1:0.101: yurex_interrupt - unknown status received: -71
> > > > rcu: INFO: rcu_sched self-detected stall on CPU
> > >
> > > Andrey:
> > >
> > > This problem may be a result of overloading dummy_timer. The kernel
> > > config you are using has CONFIG_HZ=100, but dummy-hcd needs
> > > CONFIG_HZ=1000 (see the comment on line 1789). That is, lower values
> > > of HZ will occasionally lead to trouble, and this may be an example.
> > >
> > > Can you change the config value for HZ and see if the bug still
> > > reproduces?
> >
> > Hi Alan,
> >
> > I've tried running the reproducer with CONFIG_HZ=1000 and still got
> > the same stall message. It's accompanied by countless "yurex
> > 6-1:0.101: yurex_interrupt - unknown status received: -71" messages,
> > so I believe this is an issue in the yurex driver.
>
> Maybe. Depends on exactly what the reproducer is doing, something
> which is not at all easy to figure out from the scripts or programs.
>
> I got the impression that the reproducer connects an emulated yurex
> device and then disconnects it -- but maybe that's not right at all.
> Maybe the key point is that the reproducer sends a descriptor listing
> an endpoint address that doesn't actually exist; that would have a
> similar effect. Can you tell? (Trying to understand exactly what a
> syzkaller test program does is not for the faint of heart.)
>
> As far as I can remember, the USB spec doesn't say what a device should
> do when the host sends a packet to a non-existent endpoint. Which
> means that some devices will do nothing at all, leading to the -71
> (-EPROTO) errors you see in the log. Indeed, there's only one place in
> dummy_hcd.c where -EPROTO occurs -- for the case where an URB is sent
> to an endpoint not supported by the gadget.
>
> This leads to the question: How should the yurex driver (or any USB
> class driver, in fact) respond to a -EPROTO or similar error? The
> thing is, this sort of error typically arises in two circumstances:
>
> The device was just unplugged, so of course it can't send
> any packets back to the host;
>
> Noise on the bus caused a packet to be lost or corrupted.
>
> In the first case, it doesn't much matter what the driver does because
> the disconnection will be noticed and acted on within a few hundred
> milliseconds (although I suppose a driver could generate a lot of
> kernel-log spam during that time).
>
> In the second case, retrying the lost/corrupted packet is the right
> response.
>
> But retrying is _not_ the right response in cases where the device is
> never going to respond because the endpoint address is invalid. This
> can happen only in situations where the device provides incorrect
> information (bad descriptors or something of that sort). The only
> suitable approach I can think of is to limit the number of retries.
>
> Retry-limiting is not the sort of thing we want to add to each
> individual USB class driver. Maybe it can be handled in the USB core;
> I'll try to write some code for it. Under normal circumstances the
> issue just doesn't arise, because normal devices aren't malicious.
>
> > Why does dumy_hcd require CONFIG_HZ=1000? The comment doesn't really
> > explain the reason.
>
> Oh, that's simple enough. USB events tend to happen at millisecond
> intervals. The data on the USB bus is organized into frames (and
> microframes for high speed and SuperSpeed); a frame lasts one
> millisecond (and a microframe lasts 1/8 ms). Many host controllers
> report important events when a frame boundary occurs (that's how
> dummy-hcd works).
>
> So for proper timing of the emulation, dummy-hcd requires timer
> interrupts with millisecond resolution. I suppose the driver could be
> changed to use a high-res timer instead of a normal kernel timer, but
> for now that doesn't seem particularly important.

So what are the practical differences between using CONFIG_HZ=100 and
1000 for dummy-hcd? Is is going to be slower or faster? Or can it get
overloaded with data and cause stalls? Or something else? We're
somewhat hesitant to change CONFIG_HZ as we don't know how it will
affect other parts of the kernel (at some point the USB fuzzer will
become a part of the main syzbot instance that doesn't only fuzz USB).