Re: [PATCH 20/24] irqchip/gic-v5: Add GICv5 LPI/IPI support

From: Arnd Bergmann
Date: Wed Apr 09 2025 - 06:57:26 EST


On Wed, Apr 9, 2025, at 12:11, Lorenzo Pieralisi wrote:
> On Wed, Apr 09, 2025 at 10:23:57AM +0200, Arnd Bergmann wrote:
>> On Tue, Apr 8, 2025, at 12:50, Lorenzo Pieralisi wrote:
>> > +static void irs_writeq(struct gicv5_irs_chip_data *irs_data, const u64
>> > val,
>> > + const u64 reg_offset)
>> > +{
>> > + writeq_relaxed(val, irs_data->irs_base + reg_offset);
>> > +}
>>
>> I think the use of _relaxed memory accessors needs some code
>> comments here. The definition of these is that you don't care
>> about ordering relative to DMA master accesses, yet you seem to
>> very much have accesses to the 'ist' from the GIC, as well as
>> DMA accesses from an MSI device, and I would expect both to
>> require ordering.
>
> For the 1-level (linear) IST we allocate it in one go, write
> the base address through relaxed access (that sets the IST
> valid) and poll completion with a relaxed access. Memory is
> cleaned and invalidated from the cache (if the IRS is not
> coherent) before the MMIO sequence above, which implies a
> dsb().
>
> After that memory is handed over to the GIC.
>
> For a 2-level IST, the code that updates L1 entries already add
> a dma_rmb() barrier (ie gicv5_irs_iste_alloc()) to make sure we
> order MMIO wait completion with the subsequent cache invalidate
> (again, in the yet hypothetical case where the IRS is not coherent).
>
> I think I can add comments where the sequence to initialize the
> tables is executed more than here, given that these helpers are
> used for other purposes too.

Usually my recommendation is to have abstractions like this
provide both relaxed and normal variants, and then only
use the relaxed ones where it really matters for performance.

That way you can keep relatively short explanations where
you call irs_writeq_relaxed() and use irs_writeq() without
any code comments any place that doesn't care about saving
a few cycles per call.

>> > +/* Wait for completion of an IST change */
>> > +static int gicv5_irs_ist_wait_for_idle(struct gicv5_irs_chip_data
>> > *irs_data)
>> > +{
>> > + int ret;
>> > + u32 val;
>> > +
>> > + ret = readl_relaxed_poll_timeout_atomic(
>> > + irs_data->irs_base + GICV5_IRS_IST_STATUSR, val,
>> > + FIELD_GET(GICV5_IRS_IST_STATUSR_IDLE, val), 1,
>> > + USEC_PER_SEC);
>> > +
>>
>> What is the significance of the 1 second timeout? This is probably
>> a million times longer than I would expect any hardware interaction
>> to be specified to take. Are you waiting for another thread here?
>
> It is arbitrary, agreed.

Can you make either much shorter, or non-atomic then?

>> > + l2istsz = BIT(n + 1);
>> > + if (l2istsz > KMALLOC_MAX_SIZE) {
>> > + u8 lpi_id_cap = ilog2(KMALLOC_MAX_SIZE) - 2 + istsz;
>> > +
>> > + pr_warn("Limiting LPI ID bits from %u to %u\n",
>> > + lpi_id_bits, lpi_id_cap);
>> > + lpi_id_bits = lpi_id_cap;
>> > + l2istsz = KMALLOC_MAX_SIZE;
>> > + }
>>
>> The use of KMALLOC_MAX_SIZE seem arbitrary here. I remember discussing
>> this in the past and concluding that this is fine for all cases
>> that may be relevant, but it would be good to explain the reasoning
>> in a comment.
>
> We need contiguous physical memory that can be < PAGE_SIZE or larger.
>
> For allocations larger than the allocator caches kmalloc hands over to
> the page allocator, MAX_ORDER is reflected into KMALLOC_MAX_SIZE AFAIU.
>
> That's the reasoning. Does it make sense ?

I'm more worried about what happens when KMALLOC_MAX_SIZE is
really small -- did you show that the allocation is still
going to work, or is this likely to cause runtime problems?

>> > + if (irs_data->flags & IRS_FLAGS_NON_COHERENT)
>> > + dcache_clean_inval_poc((unsigned long)ist,
>> > + (unsigned long)ist + l2istsz);
>> > + else
>> > + dsb(ishst);
>> ...
>> > + baser = (virt_to_phys(ist) & GICV5_IRS_IST_BASER_ADDR_MASK) |
>> > + FIELD_PREP(GICV5_IRS_IST_BASER_VALID, 0x1);
>>
>> Here it seems like you are open-coding the DMA mapping interface
>> details, in particular the mapping of the 'ist' memory area into
>> the gic's DMA master space, the coherency and the barrier that is
>> normally part of a (non-relaxed) writeq(). Is there a reason
>> you can't use the normal interfaces here, using dma_alloc_coherent()
>> or dma_alloc_noncoherent()?
>
> The GIC IRS must be brought up early, it is not a struct device.

Right, that is rather unfortunate.

>> Do you expect actual implementation to not be cache-coherent?
>
> It is allowed by the architecture - I don't have a crystal ball
> but if I want to add support for a non-coherent IRS the DMA mapping
> like sequence above has to be there - alternatives are welcome.

I see that we have a few GICv3 implementations that are marked
as non-coherent in DT. I don't understand why they'd do that,
but I guess there is not much to be done about it.

The only other idea I have would be to use an uncached allocation
for the non-coherent case, the same way that dma_alloc_coherent()
or maybe dma_alloc_wc() does. This still has the same problem
with bypassing the dma-mapping.h interface because of the lack
of a device pointer, but it would at least avoid the cache flushes
at runtime. If I read this code right, the data in here is only
written by the CPU and read by the GIC, so a WC buffer wouldn't
be more expensive, right?

Arnd