Re: sa8775p-ride: What's a normal SMMU TLB sync time?

From: Bjorn Andersson
Date: Thu Apr 04 2024 - 23:26:37 EST


On Tue, Apr 02, 2024 at 04:22:31PM -0500, Andrew Halaney wrote:
> Hey,
>
> Sorry for the wide email, but I figured someone recently contributing
> to / maintaining the Qualcomm SMMU driver may have some proper insights
> into this.
>
> Recently I remembered that performance on some Qualcomm platforms
> takes a major hit when you use iommu.strict=1/CONFIG_IOMMU_DEFAULT_DMA_STRICT.
>
> On the sa8775p-ride, I see most TLB sync calls to be about 150 us long,
> with some spiking to 500 us, etc:
>
> [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd start -p function_graph -g qcom_smmu_tlb_sync --max-graph-depth 1
> plugin 'function_graph'
> [root@qti-snapdragon-ride4-sa8775p-09 ~]# trace-cmd show
> # tracer: function_graph
> #
> # CPU DURATION FUNCTION CALLS
> # | | | | | | |
> 0) ! 144.062 us | qcom_smmu_tlb_sync();
>
> On my sc8280xp-lenovo-thinkpad-x13s (only other Qualcomm platform I can compare
> with) I see around 2-15 us with spikes up to 20-30 us. That's thanks to this
> patch[0], which I guess improved the platform from 1-2 ms to the ~10 us number.
>
> It's not entirely clear to me how a DPU specific programming affects system
> wide SMMU performance, but I'm curious if this is the only way to achieve this?
> sa8775p doesn't have the DPU described even right now, so that's a bummer
> as there's no way to make a similar immediate optimization, but I'm still struggling
> to understand what that patch really did to improve things so maybe I'm missing
> something.
>

The cause was that the TLB sync is synchronized with the display updates,
but without appropriate safe_lut_tlb values the display side wouldn't
play nice.

Regards,
Bjorn

> I'm honestly not even sure what a "typical" range for TLB sync time would be,
> but on sa8775p-ride its bad enough that some IRQs like UFS can cause RCU stalls
> (pretty easy to reproduce with fio basic-verify.fio for example on the platform).
> It also makes running with iommu.strict=1 impractical as performance for UFS,
> ethernet, etc drops 75-80%.
>
> Does anyone have any bright ideas on how to improve this, or if I'm even in
> the right for assuming that time is suspiciously long?
>
> Thanks,
> Andrew
>
> [0] https://lore.kernel.org/linux-arm-msm/CAF6AEGs9PLiCZdJ-g42-bE6f9yMR6cMyKRdWOY5m799vF9o4SQ@xxxxxxxxxxxxxx/
>