Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency

From: John Garry
Date: Thu Jul 16 2020 - 07:32:22 EST


On 16/07/2020 12:22, Robin Murphy wrote:
On 2020-07-16 11:56, John Garry wrote:
On 16/07/2020 11:28, Will Deacon wrote:
On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg()
loop to
get space on the queue takes approx 25% of the cycles for this
function.

This series removes that cmpxchg().

How about something much simpler like the diff below? >>
Ah, scratch that, I don't drop the lock if we fail the cas with it held.
Let me hack it some more (I have no hardware so I can only build-test
this).

Right, second attempt...

I can try it, but if performance if not as good, then please check mine
further (patch 4/4 specifically) - performance is really good, IMHO.

Perhaps a silly question (I'm too engrossed in PMU world ATM to get
properly back up to speed on this), but couldn't this be done without
cmpxchg anyway? Instinctively it feels like instead of maintaining a
literal software copy of the prod value, we could resolve the "claim my
slot in the queue" part with atomic_fetch_add on a free-running 32-bit
"pseudo-prod" index, then whoever updates the hardware deals with the
truncation and wrap bit to convert it to an actual register value.


That's what mine does. But I also need to take care of cmdq locking and how we unconditionally provide space.

Cheers,
John