On Mon, Apr 7, 2025 at 8:38 PM Robin Murphy <robin.murphy@xxxxxxx> wrote:
On 2025-04-07 6:17 am, Chen-Yu Tsai wrote:
Hi,
On Thu, Apr 3, 2025 at 6:24 PM Louis-Alexis Eyraud
<louisalexis.eyraud@xxxxxxxxxxxxx> wrote:
Currently, mtk_iommu calls during probe iommu_device_register before
the hw_list from driver data is initialized. Since iommu probing issue
fix, it leads to NULL pointer dereference in mtk_iommu_device_group when
hw_list is accessed with list_first_entry (not null safe).
So, change the call order to ensure iommu_device_register is called
after the driver data are initialized.
Fixes: 9e3a2a643653 ("iommu/mediatek: Adapt sharing and non-sharing pgtable case")
Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path")
Reviewed-by: Yong Wu <yong.wu@xxxxxxxxxxxx>
Tested-by: Chen-Yu Tsai <wenst@xxxxxxxxxxxx> # MT8183 Juniper, MT8186 Tentacruel
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@xxxxxxxxxxxxx>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@xxxxxxxxxxxxx>
Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud@xxxxxxxxxxxxx>
---
This patch fixes a NULL pointer dereference that occurs during the
mtk_iommu driver probe and observed at least on several Mediatek Genio boards:
```
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
This is a reminder to please land this and send to Linus ASAP.
This fixes the v6.15-rc1 kernel on all the MediaTek Chromebook platforms,
except for MT8188, which seems to have another issue in iommu_get_dma_domain()
used from the DRM driver:
Disabling lock debugging due to kernel taint
Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000158
From the offset and the stacktrace code dump, this would appear to be
the dereference of dev->iommu_group->default_domain, when
dev->iommu_group is NULL (and CONFIG_DEBUG_LOCK_ALLOC makes the mutex
really big). Which is a bit weird, as to get into iommu-dma at all in
that state would suggest that whatever device this is has been removed
and had its group torn down again after iommu_setup_dma_ops() has run...
but either way that implies the DRM driver is passing an arbitrary
device to the DMA API without making sure it's actualy valid.
Trying to trace the provenance of dma_dev from mtk_gem_create() back
through the rest of the driver is quite the rabbit-hole, but it seems
like in at least one case it can lead back to an
of_find_device_by_node() in ovl_adaptor_comp_init(), which definitely
looks sufficiently sketchy.
It kind of makes sense since the "display controller" is composed of
many individual hardware blocks. The struct device tied to the DRM
driver is more or less just a place holder. Only the first block,
either the OVL (overlay compositing engine) or RDMA (scanout engine)
accesses memory, so I think it makes sense to use that as the dma_dev.
With some more logs, I did find something else fishy. Here the IOMMU
for the second display pipeline fails to probe:
mtk-iommu 1c028000.iommu: error -EINVAL: Failed to register IOMMU
mtk-iommu 1c028000.iommu: probe with driver mtk-iommu failed with error -22
Then later on, deferred probe times out, and the display pipeline is
brought up regardless:
mediatek-disp-ovl 1c000000.ovl: deferred probe timeout, ignoring dependency
mediatek-disp-ovl 1c000000.ovl: Adding to IOMMU failed: -110
mediatek-disp-rdma 1c002000.rdma: deferred probe timeout, ignoring
dependency
mediatek-disp-rdma 1c002000.rdma: Adding to IOMMU failed: -110
(repeats for all the individual components of the display pipeline)
mediatek-drm mediatek-drm.16.auto: bound 1c000000.ovl (ops
mtk_disp_ovl_component_ops)
mediatek-drm mediatek-drm.16.auto: bound 1c002000.rdma (ops
mtk_disp_rdma_component_ops)
(repeats for all the individual components of the display pipeline)
mediatek-drm mediatek-drm.16.auto: DMA device is 1c000000.ovl
[drm] Initialized mediatek 1.0.0 for mediatek-drm.16.auto on minor 1
And all without a functional IOMMU.
So I think this brings up two more questions:
1. Why is the IOMMU failing to probe?
2. Why is the core code still going the IOMMU DMA alloc path if there
is no usable IOMMU?
I'll look into the first question first. Insights welcome for the second
one.