Re: [PATCH] iommu/rockchip: fix page table allocation flags for v2 IOMMU

From: Midgy Balon

Date: Fri Apr 03 2026 - 10:07:43 EST


From: Midgy BALON <midgy971@xxxxxxxxx>
To: Simon Xue <xxm@xxxxxxxxxxxxxx>
Cc: Jonas Karlman <jonas@xxxxxxxxx>, iommu@xxxxxxxxxxxxxxx,
joro@xxxxxxxxxx, will@xxxxxxxxxx, robin.murphy@xxxxxxx,
heiko@xxxxxxxxx, linux-arm-kernel@xxxxxxxxxxxxxxxxxxx,
linux-rockchip@xxxxxxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx,
stable@xxxxxxxxxxxxxxx
In-Reply-To: <5663593b-2c53-4632-ad2c-db9efa8e9ab2@xxxxxxxxxxxxxx>
References: <20260331075010.1463-1-midgy971@xxxxxxxxx>
<0f285782-b12a-4abd-bca7-b6c549bed59f@xxxxxxxxxxxxxx>
<e622cc9e-8fb0-454a-b88e-dc13cf2ff507@xxxxxxxxx>
<89ed223d-1a2c-447d-9f21-76969e668855@xxxxxxxxxxxxxx>
<5663593b-2c53-4632-ad2c-db9efa8e9ab2@xxxxxxxxxxxxxx>
Subject: Re: [PATCH] iommu/rockchip: fix page table allocation flags
for v2 IOMMU

On 4/3/2026, Simon Xue wrote:
> We internally checked that the RK356x SoCs integrate two different
> IOMMU versions (v1.0 and v2.0), like NPU and ISP use the v1.0 IOMMU.
>
> Both versions can map 40-bit physical pages, but v1.0 does not support
> placing the first-level page table above 4 GB.
>
> To fix this, I think we need to land this patch first:
> https://lore.kernel.org/all/20260310105303.128859-1-xxm@xxxxxxxxxxxxxx/
>
> Then on top of that, we can add a new compatible string to distinguish
> the IOMMU versions.

Thank you Simon and Jonas for the internal investigation. This explains
exactly what I observed.

To answer Simon's earlier question: the IP block hitting both failure
modes is the NPU IOMMU (rknpu_mmu, at 0xfde4b000), currently bound
to "rockchip,rk3568-iommu" in rk356x-base.dtsi. Both the downstream
rknpu driver and the upstream Rocket accel driver (drivers/accel/rocket/)
use this IOMMU.

The v1.0 first-level page table constraint explains both failure modes.
On boards with more than 4 GB of RAM the DTE table can be allocated
above 0x100000000, and the v1.0 hardware silently truncates or errors
on that address. The SWIOTLB bounce-buffer path is a consequence of
the same root cause: with DMA_BIT_MASK(32) on the NPU device, bounce
buffers land below 4 GB, phys_to_virt() of the bounce address is then
used as the PTE write target, and the subsequent
dma_sync_single_for_device(DMA_TO_DEVICE) overwrites those PTEs with
zeros from the original buffer.

Please consider my original patch withdrawn. Modifying iommu_data_ops_v2
was too broad and would have incorrectly constrained VOP2 MMU and all
other v2 IOMMU users.

I agree fully with the two-step approach. On top of your per-device-ops
patch [1], I plan to send:

[1/2] iommu/rockchip: Add "rockchip,rk3568-iommu-v1" compatible
for IOMMU v1.0 blocks (NPU, ISP/VICAP) on RK3568
— ops with .gfp_flags = GFP_DMA32,
.dma_bit_mask = DMA_BIT_MASK(40)
(v1.0 can still map 40-bit physical pages; only the DTE
table base must be below 4 GB)
[2/2] arm64: dts: rockchip: rk356x: Use "rockchip,rk3568-iommu-v1"
for rknpu_mmu (0xfde4b000) and vicap_mmu (0xfdfe0800)

One note on the SWIOTLB issue: with GFP_DMA32 in the new ops, page
table allocations never reach SWIOTLB, so the "track L2 base addresses"
approach you suggested should not be necessary — GFP_DMA32 prevents the
bounce-buffer poisoning at the source. Happy to be corrected if there
is another path where it is still needed.

I am happy to add Tested-by to your per-device-ops patch [1].

[1] https://lore.kernel.org/all/20260310105303.128859-1-xxm@xxxxxxxxxxxxxx/

Regards,
Midgy BALON

Le ven. 3 avr. 2026 à 06:40, Simon Xue <xxm@xxxxxxxxxxxxxx> a écrit :
>
>
> 在 2026/4/1 18:22, Simon Xue 写道:
> > Hi Jonas,
> >
> > 在 2026/4/1 16:41, Jonas Karlman 写道:
> >> Hi Simon,
> >>
> >> On 4/1/2026 9:48 AM, Simon wrote:
> >>> Hi Midgy,
> >>>
> >>> 在 2026/3/31 15:50, Midgy BALON 写道:
> >>>> commit 2a7e6400f72b ("iommu: rockchip: Allocate tables from all
> >>>> available memory for IOMMU v2") removed GFP_DMA32 from
> >>>> iommu_data_ops_v2, reasoning that RK356x and RK3588 IOMMU v2 hardware
> >>>> supports up to 40-bit physical addresses for page tables. However, the
> >>>> RK3568 IOMMU page-table walker uses a 32-bit AXI bus: it cannot access
> >>>> physical addresses above 4 GB regardless of the address encoding
> >>>> range.
> >>>>
> >>>> On boards with more than 4 GB of RAM (e.g. 8 GB LPDDR4X), removing
> >>>> GFP_DMA32 causes two distinct failure modes:
> >>>>
> >>>> 1. Direct allocation above 4 GB: iommu_alloc_pages_sz() may return
> >>>> memory above 0x100000000. The hardware page-table walker
> >>>> issues a
> >>>> bus error trying to dereference those addresses, causing an IOMMU
> >>>> fault on the first DMA transaction.
> >>> Which IP block is hitting this? We'd like to take a look on our end.
> >> I have seen reports that the NPU MMU on RK3568/RK3566 is having some
> >> issue using DTE/PTE with >32-bit addresses, maybe it uses a different
> >> MMU hw revision or has some hw errata?
> >>
> >> From my own testing at least the VOP2 MMU on RK3568 (and RK3588) was
> >> able to handle 40-bit addressable DTE/PTE, hence the original commit
> >> 2a7e6400f72b ("iommu: rockchip: Allocate tables from all available
> >> memory for IOMMU v2").
> >>
> >> As also mentioned in my reply at [1], maybe the NPU MMU has some hw
> >> limitation or errata and may need to use a different compatible.
> >
> > Yes, We are checking internally whether different IOMMU versions
> > integrated.
> >
> > I will share what we find once we have results.
> >
> We internally checked that the RK356x SoCs integrate two different IOMMU
> versions (v1.0 and v2.0), like NPU and ISP use the v1.0 IOMMU.
>
> Both versions can map 40-bit physical pages, but v1.0 does not support
> placing the first-level page table above 4 GB.
>
> To fix this, I think we need to land this patch first:
> https://lore.kernel.org/all/20260310105303.128859-1-xxm@xxxxxxxxxxxxxx/
>
> Then on top of that, we can add a new compatible string to distinguish
> the IOMMU versions.
>
> >> [1]
> >> https://lore.kernel.org/r/3cd63b3d-1c5e-4a11-856e-c4aeb5d97d55@xxxxxxxxx/
> >>
> >> Regards,
> >> Jonas
> >>
> >>>> 2. SWIOTLB bounce-buffer poisoning: without GFP_DMA32, page tables
> >>>> land
> >>>> above the SWIOTLB window. dma_map_single() with DMA_BIT_MASK(32)
> >>>> then bounces them into a buffer below 4 GB.
> >>>> rk_dte_get_page_table()
> >>>> returns phys_to_virt() of the bounce buffer address; PTEs are
> >>>> written
> >>>> there; the next dma_sync_single_for_device(DMA_TO_DEVICE)
> >>>> copies the
> >>>> original (zero) data back over the bounce buffer, silently
> >>>> erasing the
> >>>> freshly written PTEs. The IOMMU faults because every PTE
> >>>> reads as zero.
> >>> This probably need a separate patch. One way to fix it would be to
> >>> track the
> >>> original L2 page table base addresses in struct rk_iommu_domain,
> >>> then have rk_dte_get_page_table() return the tracked address instead of
> >>> deriving it from the DTE.
> >>>
> >>>> Restore GFP_DMA32 (and DMA_BIT_MASK(32)) for iommu_data_ops_v2, which
> >>>> currently only serves "rockchip,rk3568-iommu" in mainline.
> >>>>
> >>>> Tested on Radxa ROCK 3B (RK3568, 8 GB LPDDR4X):
> >>>> - MobileNetV1 via RKNN: 5.8 ms/inference (IOMMU mode)
> >>>> - YOLOv5s 640x640 via RKNN: ~57 ms/inference (IOMMU mode)
> >>>> - No IOMMU faults, correct inference results
> >>>>
> >>>> Fixes: 2a7e6400f72b ("iommu: rockchip: Allocate tables from all
> >>>> available memory for IOMMU v2")
> >>>> Cc: stable@xxxxxxxxxxxxxxx
> >>>> Cc: Jonas Karlman <jonas@xxxxxxxxx>
> >>>> Signed-off-by: Midgy BALON <midgy971@xxxxxxxxx>
> >>>> ---
> >>>> drivers/iommu/rockchip-iommu.c | 4 ++--
> >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/iommu/rockchip-iommu.c
> >>>> b/drivers/iommu/rockchip-iommu.c
> >>>> index 85f3667e797..8b45db29471 100644
> >>>> --- a/drivers/iommu/rockchip-iommu.c
> >>>> +++ b/drivers/iommu/rockchip-iommu.c
> >>>> @@ -1358,8 +1358,8 @@ static struct rk_iommu_ops iommu_data_ops_v2 = {
> >>>> .pt_address = &rk_dte_pt_address_v2,
> >>>> .mk_dtentries = &rk_mk_dte_v2,
> >>>> .mk_ptentries = &rk_mk_pte_v2,
> >>>> - .dma_bit_mask = DMA_BIT_MASK(40),
> >>>> - .gfp_flags = 0,
> >>>> + .dma_bit_mask = DMA_BIT_MASK(32),
> >>>> + .gfp_flags = GFP_DMA32,
> >>>> };
> >>>> static const struct of_device_id rk_iommu_dt_ids[] = {
> >>