Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

From: David Brownell
Date: Tue Aug 21 2007 - 00:27:37 EST


[ GRR sorry for the premature "SEND" ... mouspads-r-evil ]

On Monday 20 August 2007, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, David Brownell wrote:
>
> > On Monday 13 August 2007, Michal Piotrowski wrote:
> > > Subject         : EHCI Regression in 2.6.23-rc2
> > > References      : http://lkml.org/lkml/2007/8/10/81
> > > Last known good : ?
> > > Submitter       : Daniel Exner <dex@xxxxxxxxxxxxxx>
> > > Caused-By       : Stuart_Hayes@xxxxxxxx <Stuart_Hayes@xxxxxxxx>
> > >                   commit 196705c9bbc03540429b0f7cf9ee35c2f928a534
> > > Handled-By      : ?
> > > Status          : unknown
> >
> > Fixed I believe by Stuart's patch:
> >
> > http://marc.info/?l=linux-usb-devel&m=118765934722610&w=2
>
> Quite frankly, I'd personally prefer to just revert commit
> 196705c9bbc03540429b0f7cf9ee35c2f928a534 entirely instead.
>
> The whole dependency on cpufreq seems totally bogus. Would it not be a lot
> more natural to handle the *result* of the problem (ie the MMF errors by
> broken EHCI controllers?) rather than add totally insane workarounds for
> this case to try to hide them in the first place?

MMF basically means the "Transaction Translating" (TT) Hub had data
for the host, but the host didn't collect it in time ... so that some
data was lost. In this context, it's only for periodic transfers,
meaning interrupt (for HID keyboards, mice, etc) or isochronous (mostly
audio or video, but sometimes ATM).

Unfortunately, that's the type of fault that's especially hard to
recover from. Plus, very few of the upper layer drivers have even
a minor clue about fault recovery strategies during I/O ... it's not
supposed to happen for interrupt transfers, one hopes the drivers
will at least die gracefully. With ISO, faults are expected (since
it's "best effort" delivery, time-priority).

On the plus side, MMF errors have been vanishingly rare until this
cpufreq interaction came up ... which of course implies the downside
that those "handle the result" code paths are all but untested.


> There can be *other* delays in reading memory that have nothing to do with
> cpu frequency shifting, and everything to do with exteme situations on the
> bus. If the stupid EHCI controller has some tight latency issues, that's a
> generic problem.

There could be such problems, yes. But in practice, I don't know
that we've ever seen them.

(There's a first time for everthing, yes. I *just* fetched a webpage
where an image got overwritten about half way through fetching it.
Top half was today's, bottom half was tomorrow's, update 12 midnight EST.
Strangest looking JPG ever! ;)


> That commit 196705c9bbc03540429b0f7cf9ee35c2f928a534 just exemplifies what
> is wrong with USB, but it does so by adding incredibly ugly code. I'd
> rather not add even *more* ugly code - especially not for a case where we
> then seem to blame the wrong party (ie a VIA controller that didn't need
> the ugly code in the first place).
>
> Serverworks/Broadcom makes totally crap chips (not just in USB) and then
> doesn't even document their buggy crap hardware.

That's pretty much how I feel about VIA's USB stuff:
buggy crap that I actively steer people away from.

And that's why it doesn't seem odd to me to add even
more workarounds for VIA-only bugs.


> But that is NOT a reason
> for then making the kernel have buggy crap software in it.

I don't think we always have the option not to cope with
broken hardware. We may have some options about *how* we
cope with it though ...


> So really - is there any reason why we just don't say "Broadcom chips
> suck, and get MMF errors under normal circumstances because they are
> crap". And from *that*, the obvious solution would seem to not be to
> penalize everybody else, but to just say that "We will try to recover from
> MMF errors gracefully by retrying the transaction". Hmm?

Well, see above about why retrying wouldn't work well. Data lost, and
not recoverable ... although if the events are only USB keyboards/mice,
then the user might be able to recover. (Stuart?)

Alternatively, if Broadcom then just veto cpufreq changes whenever
there are USB interrupt transfers active. (We *can* veto changes
in notifiers, yes?)

- Dave



>
> Linus
>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/