On Mon, Jan 06, 2025 at 06:46:08PM +0530, Charan Teja Kalla wrote:
On 1/3/2025 9:04 PM, Will Deacon wrote:
On Thu, Jan 02, 2025 at 04:09:46PM +0530, Charan Teja Kalla wrote:
On 12/13/2024 8:34 PM, Charan Teja Kalla wrote:
Race between smmu device probe and iommu device probe is resulting into
wrong DMA ops used for the device.
bus_iommu_probe(P1) of_iommu_configure(P2)
(from iommu_device_register) (from insmod of a driver)
-------------------------- --------------------------
1) probe_iommu_group()
(Fills just dev->iommu_group
under iommu_probe_device_lock)
2) iommu_probe_device():
(acquires the iommu_probe_device_lock,
but since dev->iommu_group is already
set, it just returns without
allocating the domain.)
3) of_dma_configure_id()->
arch_setup_dma_ops() gets called for
this device, but returns with the
error message:
"Failed to set up IOMMU for device;
retaining platform DMA ops"
4) device probe is succeeded where it
can now setup iommu mappings using
wrong ->dma_ops
5) domain will be allocated later
and fills iommu_group->domain.
Step 3) and 4) denies iommu device from using 'iommu_dma_ops'.
This issue does exists on 6.6 because of commit f188056352bc ("iommu:
Avoid locking/unlocking for iommu_probe_device()") without which 2)
returns only after filling the domain under group->mutex, thus no issue.
With commit b5c58b2fdc42("dma-mapping: direct calls for dma-iommu"),
->iommu_dma_ops is removed altogether and iommu api are directly being
called under "use_dma_iommu(): return dev->dma_iommu". Again,
->dma_iommu flag is set only after domain allocation. So, If
there is a sufficient delay between steps 4) and 5), issue is still
exists but that seems very narrow to me.
I am thinking, setup of the domain before returning from step2) is the
way to fix this issue. Can you please help if you have any suggestions?
Signed-off-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx>
---
drivers/iommu/iommu.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
Just a quick ask if you have any comments for the below. BTW, in
internal tests, when we revert the commit f188056352bc ("iommu: Avoid
locking/unlocking for iommu_probe_device()", it is helping.
I can share more details If the below is not sufficient.
Thanks in advance...
Can you reproduce the issue on 6.13-rc5? We bodged some of the probe
code recently:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=229e6ee43d2a160a1592b83aad620d6027084aad
and a better set of fixes are queued in -next.
Yes, we did try a better patch from Robin[1] and the issue is still
reproducible.
Did you try the -EPROBE_DEFER hack that got merged?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=229e6ee43d2a160a1592b83aad620d6027084aad
Otherwise, you can try the stack queued for 6.14 in iommu/linux.git:
46b3df8eb9bd iommu: Manage driver probe deferral better
fcbd62156742 iommu/arm-smmu-v3: Clean up more on probe failure
97cb1fa02726 iommu/arm-smmu: Retire probe deferral workaround
7d835134d4e1 iommu/arm-smmu: Make instance lookup robust
IIUC, these patches are trying to get valid smmu device, during the
probe of an smmu client device, by traversing proper 'klist_devices'
happening from iommu_init_device(). Please CMIW.
Well, they're fixing a race between the IOMMU probing and its client
devices attaching.
But this bug makes iommu_init_device() to completely gets skipped.
static int __iommu_probe_device(struct device *dev, ...)
{
/* Device is probed already if in a group */
if (dev->iommu_group)
return 0; --> driver call returns from here.
ret = iommu_init_device(dev, ops);
if (ret)
return ret;
}
Regarding the testing, we work on Android that is on LTS kernels, so, it
may not be possible for us to try the thing on 6.13 and this seems a bug
on LTS 6.6 kernel.
We need to fix 6.13 before we fix 6.6 unless you can show that 6.13 is
unaffected (in which case, we can backport the fix(es)).