Re: [PATCH v4 14/22] iommu: handle page response timeout

From: Jean-Philippe Brucker
Date: Mon Apr 30 2018 - 06:58:26 EST

On 25/04/18 16:37, Jacob Pan wrote:
>> In the other cases (unsupported PRI or rogue guest) then disabling PRI
>> using a FAILURE status might be the right thing to do. However,
>> assuming the device follows the PCI spec it will stop sending page
>> requests once there are as many PPRs in flight as the allocated
>> credit.
> Agreed, here I am not taking any actions. There may be need to drain
> in-fly requests.

Right, as long as we first ensure that no new fault is generated (by
using a Response Failure). Though in my opinion not taking action might
be the safest option :)

Another thought: currently the comment in iommu.h says
"@IOMMU_FAULT_STATUS_FAILURE: General error. Drop all subsequent faults
from this device if possible. This is "Response Failure" in PCI PRI."

I wonder if we should simply say "Drop all subsequent faults from the
device". Even if the PCI device doesn't properly implement PRI, the
IOMMU driver should set a "PRI disabled" bit in the device data that
prevents it from from reporting new faults and flooding the queue.
Anyway, it's a small detail that could go in a future patch series.

>> If there isn't any possibility of memory leak or abusing resources, I
>> don't think it's our problem that the guest is excessively slow at
>> handling page requests. Setting an upper bound to page request latency
>> might do more harm than good. Ensuring that devices respect the number
>> of allocated in-flight PPRs is more important in my opinion.
> How about we have a really long timeout, e.g. 1 min similar to device
> invalidate response timeout in ATS spec., just for basic safety and
> diagnosis. Optionally, we could have quota in parallel.

I agree that for development a timeout is useful. It might be worth
adding it as an option to the IOMMU module instead of a define. Perhaps
a number of seconds, 10 being the default and 0 disabling the timeout?
Otherwise we would probably end up with a succession of patches
incrementing the timeout by arbitrary values, if people find it