Re: hpsa driver bug crack kernel down!

From: Woodhouse, David
Date: Thu Apr 10 2014 - 11:38:04 EST


On Thu, 2014-04-10 at 09:14 -0600, Bjorn Helgaas wrote:
> > Thus, my first guess would be that we are quite happily setting up the
> > requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> > device actually tries to do DMA.
> >
> I like the "wrong IOMMU (or no IOMMU at all)" theory. If we didn't
> connect the device with an IOMMU at all, that would explain the device
> DMAing directly to a physical address, wouldn't it?

An unlikely failure mode. We're much more likely to see *wrong* IOMMU
than no IOMMU. And thus we'd still see the distinctive virtual addresses
just below 4GiB.

However, Rob's answer may solve that puzzle. If this is one of those
abominations where the device continues to do DMA to system memory even
after the OS is up and running and *thinks* it has control of the
hardware, then the offending address will be listed in an RMRR entry
(which tells the OS to set up a 1:1 mapping for access to certain memory
ranges for a given device). And will be inside an E820 reserved region.

A little odd that such an error would trigger only when we're actually
trying to initialise the device from the Linux driver, not as soon as we
enable the IOMMU. But all things are possible.

But the DMAR table and dmesg that I asked for would give us a bit more
information and hopefully let us stop speculating...

> > We should also rate-limit DMA faults, which would avoid the lockup
> > failure mode. Bjorn, what should an IOMMU driver *do* when it detects
> > that a device is creating an endless stream of DMA faults and isn't
> > aborting the transaction?
>
> You mentioned that POWER with EEH does something intelligent in this
> case, but I'm not familiar with that code. We have AER support, which
> can result in resetting a device, but I think DMA faults are reported
> differently, and I don't think there's any nice existing way for PCI
> to deal with them. Maybe there should be, though.

Quite frankly, I don't care how *you* deal with them, or even if you
can. All I want to know is how I tell you about the problem, because *I*
sure as hell don't want to be trying to deal with it in the IOMMU code.
That's a generic PCI layer thing. :)

--
David Woodhouse Open Source Technology Centre
David.Woodhouse@xxxxxxxxx Intel Corporation

Attachment: smime.p7s
Description: S/MIME cryptographic signature