I'm looking at this code after these patches are applied and it still
seems quite bonkers to me 🙁
Why do we allocate two copies of the memory on all fault paths?
Why do we have fault->type still that only has one value?
What is serializing iommu_get_domain_for_dev_pasid() in the fault
path? It looks sort of like the plan is to use iopf_param->lock and
ensure domain removal grabs that lock at least after the xarray is
changed - but does that actually happen?
I would suggest, broadly, a flow for iommu_report_device_fault() sort
of:
1) Allocate memory for the evt. Every path except errors needs this,
so just do it
2) iopf_get_dev_fault_param() should not have locks in it! This is
fast path now. Use a refcount, atomic compare exchange to allocate,
and RCU free.
3) Everything runs under the fault_param->lock
4) Check if !IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, set it aside and then
exit! This logic is really tortured and confusing
5) Allocate memory and assemble the group
6) Obtain the domain for this group and incr a per-domain counter that a
fault is pending on that domain
7) Put the*group* into the WQ. Put the*group* on a list in fault_param
instead of the individual faults
8) Don't linear search a linked list in iommu_page_response()! Pass
the group in that we got from the WQ that we*know* is still
active. Ack that passed group.
When freeing a domain wait for the per-domain counter to go to
zero. This ensures that the WQ is flushed out and all the outside
domain references are gone.
When wanting to turn off PRI make sure a non-PRI domain is
attached to everything. Fence against the HW's event queue. No new
iommu_report_device_fault() is possible.
Lock the fault_param->lock and go through every pending group and
respond it. Mark the group memory as invalid so iommu_page_response()
NOP's it. Unlock, fence the HW against queued responses, and turn off
PRI.
An*optimization* would be to lightly flush the domain when changing
the translation. Lock the fault_param->lock and look for groups in the
list with old_domain. Do the same as for PRI-off: respond to the
group, mark it as NOP. The WQ may still be chewing on something so the
domain free still has to check and wait.