Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework

From: Alan Stern
Date: Fri Oct 21 2016 - 16:57:22 EST

On Fri, 21 Oct 2016, Michal Necasek wrote:

> Alan,
> I'll get back to you on whether increasing the timeout helps, it'll
> take a bit of testing. Does it actually sound plausible that some
> controllers could not get things done in 250ms but could in 275ms?

Not very, I admit.

> I will note that according to the table in <1>, a 250ms timeout with
> a HZ=250 kernel (what Ubuntu uses) falls into the 4ms granularity
> bucket, but a 275ms timeout goes into the 32ms granularity bucket.
> That could change things.

Thomas has noted that there is definitely a bug in the new timer code.

> And as a bit of background... the 250ms timeout should not be a
> problem in a virtualized environment under normal conditions. A ~10ms
> timeout is. What makes things harder is that the watchdog routine
> reads the frame counter from the HCCA, which has to be updated
> asynchronously. Reading from a HC register would actually be more
> expensive but much more reliable in this case. But again, 250ms
> should be plenty...
> Another factor is that the OHCI watchdog just kills the driver the
> first time there's a problem, there's no recovery attempt. So it's
> very noticeable when this happens.

I should think so! The whole point of the watchdog is to catch bad
hardware when it starts misbehaving. If that happens, we need to
cancel everything and kill the driver so that it doesn't take large
parts of the system down with it -- which is what used to happen.

Alan Stern

> Regards,
> Michal
> <1>

> ----- Original Message -----
> From: stern@xxxxxxxxxxxxxxxxxxx
> To: michael.thayer@xxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx, tglx@xxxxxxxxxxxxx, michal.necasek@xxxxxxxxxx, knut.osmundsen@xxxxxxxxxx
> Sent: Friday, October 21, 2016 6:54:13 PM GMT +01:00 Amsterdam / Berlin / Bern / Rome / Stockholm / Vienna
> Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework
> On Fri, 21 Oct 2016, Michael Thayer wrote:
> > Hello Alan (LKML on CC),
> >
> > Contacting you about this on Thomas Gleixner's (also on CC) suggestion.
> > The short summary is that when Linux 4.8.0 (also tested with a few later
> > kernels) is run on a VirtualBox virtual machine with USB enabled, OHCI
> > fails with the log messages "frame counter not updated; disabled" and
> > "HC died; cleaning up". This seems to be due to the 250 ms interval
> > watchdog running with far too short intervals, which we think is a
> > consequence of the timer wheel code rework. I will refer you to a bug
> > filed in Launchpad<1> for a longer description.
> >
> > Hope this is of interest to you.
> >
> > Regards,
> >
> > Michael
> >
> > <1>
> That bug description says the watchdog timer routine can be called
> twice in a 4-ms period, even though it requests a 250-ms delay. Is
> this really true? If it is, it sounds like a real bug in the timer
> core.
> Bryan Paluch reported a similar problem and said that increasing the
> timeout value to 275 ms fixed it:
> Does that patch also fix the "frame counter not updating" problem?
> Alan Stern