Re: [bug report] iommu/arm-smmu-v3: Event cannot be printed in some scenarios

From: Jason Gunthorpe
Date: Thu Jul 25 2024 - 08:58:58 EST


On Thu, Jul 25, 2024 at 07:35:00AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@xxxxxxxx>
> > Sent: Wednesday, July 24, 2024 9:03 PM
> >
> > On Wed, Jul 24, 2024 at 11:24:17AM +0100, Will Deacon wrote:
> > > > This event handling process is as follows:
> > > > arm_smmu_evtq_thread
> > > >     ret = arm_smmu_handle_evt
> > > >         iommu_report_device_fault
> > > >             iopf_param = iopf_get_dev_fault_param(dev);
> > > >             // iopf is not enabled.
> > > > // No RESUME will be sent!
> > > >             if (WARN_ON(!iopf_param))
> > > >                 return;
> > > >     if (!ret || !__ratelimit(&rs))
> > > >         continue;
> > > >
> > > > In this scenario, the io page-fault capability is not enabled.
> > > > There are two problems here:
> > > > 1. The event information is not printed.
> > > > 2. The entire device(PF level) is stalled,not just the current
> > > > VF. This affects other normal VFs.
>
> Out of curiosity. From your code example the difference before
> and after this change is on the prints. Why would it lead to the
> stall problem?

Because of this:

iopf_param = iopf_get_dev_fault_param(dev);
if (WARN_ON(!iopf_param))
- return;

If you hit the WARN_ON then we don't do anything with the fault and it
remains uncompleted.

> > + * and the fault remains owned by the caller. The caller should log the DMA
> > + * protection failure and resolve the fault. Otherwise on success the fault is
> > + * always completed eventually.
>
> About "resolve the fault", I didn't find such logic from smmu side in
> arm_smmu_evtq_thread(). It just logs the event. Is it asking for new
> change in smmu driver or reflecting the current fact which if missing
> leads to the said stall problem?

It was removed in b554e396e51c ("iommu: Make iopf_group_response() return void")

ret = iommu_report_device_fault(master->dev, &fault_evt);
- if (ret && flt->type == IOMMU_FAULT_PAGE_REQ) {
- /* Nobody cared, abort the access */
- struct iommu_page_response resp = {
- .pasid = flt->prm.pasid,
- .grpid = flt->prm.grpid,
- .code = IOMMU_PAGE_RESP_FAILURE,
- };
- arm_smmu_page_response(master->dev, &fault_evt, &resp);
- }
-

Part of the observation going into b554e396e51c was that all drivers
have something like the above, and we can pull it into the core code.

So perhaps we should still always abort the request from
iommu_report_device_fault() instead of requiring boilerplate like
above in drivers. That does some better.

The return code only indicates if the event should be logged.

> > /*
> > * On success iopf_handler must call iopf_group_response() and
> >
>
> Now given a return value is required we should also return '0'
> in the following path with a valid iopf_handler.

Yes, that was my intention

Jason