Re: [PATCH] iommu/io-pgtable-arm: Add support for contiguous hint bit

From: Daniel Mentz

Date: Fri Jun 19 2026 - 15:41:15 EST

On Thu, Jun 18, 2026 at 2:06 AM Vijayanand Jitta
<vijayanand.jitta@xxxxxxxxxxxxxxxx> wrote:
> Support is gated behind CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT, which
> provides a compile-time opt-out for hardware affected by SMMU errata
> related to the contiguous bit.

Have you considered making this a runtime option? Compare this with
arm_smmu_device_iidr_probe() where the smmuv3 driver disables certain
features based on the identified implementation and the errata
affecting that implementation.

> On the mapping side, __arm_lpae_map() detects when the requested size
> matches a contiguous range at the next level, sets the CONT bit on all
> PTEs in the group, then recurses with the base block size and an
> adjusted pgcount.

I would perform this check at the current level not the previous
level. See comments below.

>
> On the unmapping side, the CONT bit is cleared from all PTEs in the
> affected contiguous group before any individual entry is invalidated,
> following the Break-Before-Make requirement of the architecture.

My understanding is that for unmap operations, the following rule applies:

The IOVA range targeted by an unmap operation must exactly match the
IOVA range of a previous map operation. Partial unmap operations are
not allowed.

The iopgtable code previously had a function named
arm_lpae_split_blk_unmap() which allowed a block mapping to be split
up. However, that function has since been removed, which aligns with
prohibiting partial unmaps.
The other concern I have is a potential race condition: While one
thread clears the contiguous bit, another thread could try to unmap
the same descriptor.

Consider dropping support for partial unmap and just triggering a
WARN_ON() if you detect that a contiguous group is partially unmapped.

> +static inline int arm_lpae_cont_pmds(unsigned long size)

PMD is not a term that is used in this file. I advise against
introducing this term.

> +static u32 arm_lpae_find_num_cont(struct arm_lpae_io_pgtable *data, int lvl)
> +{
> + if (lvl == ARM_LPAE_MAX_LEVELS - 2)
> + return arm_lpae_cont_pmds(ARM_LPAE_BLOCK_SIZE(lvl, data));
> + else if (lvl == ARM_LPAE_MAX_LEVELS - 1)
> + return arm_lpae_cont_ptes(ARM_LPAE_BLOCK_SIZE(lvl, data));

Consider supporting the contiguous bit at lookup level 1.

> static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> phys_addr_t paddr, size_t size, size_t pgcount,
> arm_lpae_iopte prot, int lvl, arm_lpae_iopte *ptep,
> @@ -463,6 +583,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> size_t tblsz = ARM_LPAE_GRANULE(data);
> struct io_pgtable_cfg *cfg = &data->iop.cfg;
> int ret = 0, num_entries, max_entries, map_idx_start;
> + u32 num_cont = 1;
>
> /* Find our entry at the current level */
> map_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> @@ -505,6 +626,24 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> return -EEXIST;
> }
>
> + if (arm_lpae_pte_is_contiguous_range(data, size, lvl + 1, &num_cont)) {

I would recommend performing this check at the actual level not at the
previous lookup level i.e. not at the (lvl - 1) level. Imagine the
following situation: The granule size is 4KB, the initial lookup level
is 2, and size is 32MB. I'm wondering if in that case, it'll just keep
recursing until it hits (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1)).

> +#ifdef CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT
> +static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
> + unsigned long iova, int lvl,
> + arm_lpae_iopte *ptep, size_t num_entries)
> +{
> + struct io_pgtable_cfg *cfg = &data->iop.cfg;
> + u32 num_cont = arm_lpae_find_num_cont(data, lvl);
> + arm_lpae_iopte *cont_ptep;
> + arm_lpae_iopte *cont_ptep_start;
> + unsigned long cont_iova;
> + int offset, itr;
> +
> + cont_ptep = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
> + cont_iova = round_down(iova,
> + ARM_LPAE_BLOCK_SIZE(lvl, data) * num_cont);

As a result of this round_down() function, you are accessing a
descriptor that describes an IOVA outside the range targeted by the
iommu_unmap call. Consequently, you might race against another thread
accessing the same descriptor.

> + cont_ptep += ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
> + cont_ptep_start = cont_ptep;
> +
> + /*
> + * iova may not be aligned to the contiguous group boundary; include
> + * any leading entries so round_up() covers all overlapping groups.
> + */
> + offset = ARM_LPAE_LVL_IDX(iova, lvl, data) -
> + ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
> + num_entries = round_up(offset + num_entries, num_cont);
> +
> + for (itr = 0; itr < num_entries; itr++) {
> + WRITE_ONCE(*cont_ptep, READ_ONCE(*cont_ptep) & ~ARM_LPAE_PTE_CONT);

This read-modify-write operation is not safe due to the potential race
described above.

> + cont_ptep++;
> + }
> +
> + if (!cfg->coherent_walk)
> + __arm_lpae_sync_pte(cont_ptep_start, num_entries, cfg);
> +}
> +#else
> +static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
> + unsigned long iova, int lvl,
> + arm_lpae_iopte *ptep, size_t num_entries)
> +{
> +}
> +#endif
> +
> static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> struct iommu_iotlb_gather *gather,
> unsigned long iova, size_t size, size_t pgcount,
> @@ -660,7 +841,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> {
> arm_lpae_iopte pte;
> struct io_pgtable *iop = &data->iop;
> - int i = 0, num_entries, max_entries, unmap_idx_start;
> + int i = 0, num_cont = 1, num_entries, max_entries, unmap_idx_start;
>
> /* Something went horribly wrong and we ran out of page table */
> if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> @@ -675,9 +856,15 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> }
>
> /* If the size matches this level, we're in the right place */
> - if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
> + if (size == ARM_LPAE_BLOCK_SIZE(lvl, data) ||
> + (size == arm_lpae_find_num_cont(data, lvl) *
> + ARM_LPAE_BLOCK_SIZE(lvl, data))) {
> + size_t pte_size;
> +
> max_entries = arm_lpae_max_entries(unmap_idx_start, data);
> - num_entries = min_t(int, pgcount, max_entries);
> + num_cont = arm_lpae_check_num_cont(data, size, lvl);
> + num_entries = min_t(int, num_cont * pgcount, max_entries);
> + pte_size = size / num_cont;
>
> /* Find and handle non-leaf entries */
> for (i = 0; i < num_entries; i++) {
> @@ -687,11 +874,27 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> break;
> }
>
> + /*
> + * Break-Before-Make: before invalidating any leaf
> + * entry, clear the CONT bit from every entry in the
> + * contiguous group(s) and flush the TLB, as required
> + * by the architecture. arm_lpae_cont_clear() covers
> + * the full [iova, iova + num_entries * pte_size) range
> + * via round_up(), so subsequent entries read back
> + * CONT=0 and skip this block.
> + */
> + if (pte & ARM_LPAE_PTE_CONT) {
> + arm_lpae_cont_clear(data, iova, lvl, ptep, num_entries);
> + io_pgtable_tlb_flush_walk(iop, iova,
> + num_entries * pte_size,
> + ARM_LPAE_GRANULE(data));

I believe this is inefficient. Consider the case where we unmap 2MB
worth of IOVA space mapped by 512 4KB page descriptors with the
contiguous bit set. If I'm not mistaken, you're running CMOs
(__arm_lpae_sync_pte) twice for every page descriptor. In addition,
io_pgtable_tlb_flush_walk() will submit an extra CMD_SYNC and wait for
it's completion.

Additionally, you perform rounding in arm_lpae_cont_clear(). However,
io_pgtable_tlb_flush_walk() is called on the original, potentially
unaligned range. Can this lead to under invalidation? Again, my
preference would be to drop support for partial unmaps which would
also remove the requirement for calling io_pgtable_tlb_flush_walk()
here.

> + }
> +
> if (!iopte_leaf(pte, lvl, iop->fmt)) {
> __arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
>
> /* Also flush any partial walks */
> - io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
> + io_pgtable_tlb_flush_walk(iop, iova + i * pte_size, pte_size,
> ARM_LPAE_GRANULE(data));
> __arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
> }