Re: [PATCH v7 6/6] drm/msm: iommu: Replace runtime calls with runtime suppliers
From: Tomasz Figa
Date: Thu Feb 22 2018 - 03:14:29 EST
On Fri, Feb 16, 2018 at 9:13 AM, Tomasz Figa <tfiga@xxxxxxxxxxxx> wrote:
> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@xxxxxxx> wrote:
>> On 15/02/18 04:17, Tomasz Figa wrote:
>> [...]
>>>>
>>>> Could you elaborate on what kind of locking you are concerned about?
>>>> As I explained before, the normally happening fast path would lock
>>>> dev->power_lock only for the brief moment of incrementing the runtime
>>>> PM usage counter.
>>>
>>>
>>> My bad, that's not even it.
>>>
>>> The atomic usage counter is incremented beforehands, without any
>>> locking [1] and the spinlock is acquired only for the sake of
>>> validating that device's runtime PM state remained valid indeed [2],
>>> which would be the case in the fast path of the same driver doing two
>>> mappings in parallel, with the master powered on (and so the SMMU,
>>> through device links; if master was not powered on already, powering
>>> on the SMMU is unavoidable anyway and it would add much more latency
>>> than the spinlock itself).
>>
>>
>> We now have no locking at all in the map path, and only a per-domain lock
>> around TLB sync in unmap which is unfortunately necessary for correctness;
>> the latter isn't too terrible, since in "serious" hardware it should only be
>> serialising a few cpus serving the same device against each other (e.g. for
>> multiple queues on a single NIC).
>>
>> Putting in a global lock which serialises *all* concurrent map and unmap
>> calls for *all* unrelated devices makes things worse. Period. Even if the
>> lock itself were held for the minimum possible time, i.e. trivially
>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that
>> one cache line around between 96 CPUs across two sockets is not negligible.
>
> Fair enough. Note that we're in a quite interesting situation now:
> a) We need to have runtime PM enabled on Qualcomm SoC to have power
> properly managed,
> b) We need to have lock-free map/unmap on such distributed systems,
> c) If runtime PM is enabled, we need to call into runtime PM from any
> code that does hardware accesses, otherwise the IOMMU API (and so DMA
> API and then any V4L2 driver) becomes unusable.
>
> I can see one more way that could potentially let us have all the
> three. How about enabling runtime PM only on selected implementations
> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded
> with if (pm_runtime_enabled()), which is lockless?
>
Sorry for pinging, but any opinion on this kind of approach?
Best regards,
Tomasz
>>
>>> [1]
>>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028
>>> [2]
>>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613
>>>
>>> In any case, I can't imagine this working with V4L2 or anything else
>>> relying on any memory management more generic than calling IOMMU API
>>> directly from the driver, with the IOMMU device having runtime PM
>>> enabled, but without managing the runtime PM from the IOMMU driver's
>>> callbacks that need access to the hardware. As I mentioned before,
>>> only the IOMMU driver knows when exactly the real hardware access
>>> needs to be done (e.g. Rockchip/Exynos don't need to do that for
>>> map/unmap if the power is down, but some implementations of SMMU with
>>> TLB powered separately might need to do so).
>>
>>
>> It's worth noting that Exynos and Rockchip are relatively small
>> self-contained IP blocks integrated closely with the interfaces of their
>> relevant master devices; SMMU is an architecture, implementations of which
>> may be large, distributed, and have complex and wildly differing internal
>> topologies. As such, it's a lot harder to make hardware-specific assumptions
>> and/or be correct for all possible cases.
>>
>> Don't get me wrong, I do ultimately agree that the IOMMU driver is the only
>> agent who ultimately knows what calls are going to be necessary for whatever
>> operation it's performing on its own hardware*; it's just that for SMMU it
>> needs to be implemented in a way that has zero impact on the cases where it
>> doesn't matter, because it's not viable to specialise that driver for any
>> particular IP implementation/use-case.
>
> Still, exactly the same holds for the low power embedded use cases,
> where we strive for the lowest possible power consumption, while
> maintaining performance levels high as well. And so the SMMU code is
> expected to also work with our use cases, such as V4L2 or DRM drivers.
> Since these points don't hold for current SMMU code, I could say that
> the it has been already specialized for large, distributed
> implementations.
>
> Best regards,
> Tomasz