Re: [PATCH] iommu/vt-d: Ratelimit fault handler
From: Alex Williamson
Date: Thu Mar 17 2016 - 12:53:27 EST
On Tue, 15 Mar 2016 19:47:56 +0000
David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> On Tue, 2016-03-15 at 10:35 -0600, Alex Williamson wrote:
> > Fault rates can easily overwhelm the console and make the system
> > unresponsive. Ratelimit to allow an opportunity for maintenance.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>
> Rather than just rate-limiting the printk, I'd prefer to handle this
> explicitly. There's a bit in the context-entry which can tell the IOMMU
> not to bother raising an interrupt at all. And then we can re-enable it
> if/when the driver recovers the device. (Or perhaps just when it next
> does a mapping).
Seems like we need to keep statistics per context entry for that, are
you prepared for that sort of overhead? IME, a device that's spewing
faults at this rate is broken to the point where it needs to be removed
from the system or is actively being tested and debugged for driver or
assignment work. In those case, I think we want to keep reminding the
user that something is very wrong and it probably explains why the
device isn't working properly. If the device is using the DMA API,
maybe clearing FPD on each mapping event is a way to do that, but an
IOMMU API managed device might have very long lived mapping entries.
It seems impractical to setup statistics per context entry and timers
to check back on them for things that really ought to be rare events.
My goal was only to reduce the overall impact on the system so that
it's usable when this occurs.
> We really ought to be reporting faults to drivers too, FWIW. I keep
> meaning to take a look at that.
Yes, that path has been absent for far too long. Thanks,
Alex