Re: [PATCH 0/4] iommu/arm-smmu-v3: Improve cmdq lock efficiency

From: Robin Murphy
Date: Thu Jul 16 2020 - 07:22:24 EST


On 2020-07-16 11:56, John Garry wrote:
On 16/07/2020 11:28, Will Deacon wrote:
On Thu, Jul 16, 2020 at 11:22:33AM +0100, Will Deacon wrote:
On Thu, Jul 16, 2020 at 11:19:41AM +0100, Will Deacon wrote:
On Tue, Jun 23, 2020 at 01:28:36AM +0800, John Garry wrote:
As mentioned in [0], the CPU may consume many cycles processing
arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to
get space on the queue takes approx 25% of the cycles for this function.

This series removes that cmpxchg().

How about something much simpler like the diff below? >>
Ah, scratch that, I don't drop the lock if we fail the cas with it held.
Let me hack it some more (I have no hardware so I can only build-test this).

Right, second attempt...

I can try it, but if performance if not as good, then please check mine further (patch 4/4 specifically) - performance is really good, IMHO.

Perhaps a silly question (I'm too engrossed in PMU world ATM to get properly back up to speed on this), but couldn't this be done without cmpxchg anyway? Instinctively it feels like instead of maintaining a literal software copy of the prod value, we could resolve the "claim my slot in the queue" part with atomic_fetch_add on a free-running 32-bit "pseudo-prod" index, then whoever updates the hardware deals with the truncation and wrap bit to convert it to an actual register value.

Robin.


Thanks,


Will

--->8

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index f578677a5c41..e6bcddd6ef69 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -560,6 +560,7 @@ struct arm_smmu_cmdq {
ÂÂÂÂÂ atomic_long_tÂÂÂÂÂÂÂÂÂÂÂ *valid_map;
ÂÂÂÂÂ atomic_tÂÂÂÂÂÂÂÂÂÂÂ owner_prod;
ÂÂÂÂÂ atomic_tÂÂÂÂÂÂÂÂÂÂÂ lock;
+ÂÂÂ spinlock_tÂÂÂÂÂÂÂÂÂÂÂ slock;
 };
 struct arm_smmu_cmdq_batch {
@@ -1378,7 +1379,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
ÂÂÂÂÂ u64 cmd_sync[CMDQ_ENT_DWORDS];
ÂÂÂÂÂ u32 prod;
ÂÂÂÂÂ unsigned long flags;
-ÂÂÂ bool owner;
+ÂÂÂ bool owner, locked = false;
ÂÂÂÂÂ struct arm_smmu_cmdq *cmdq = &smmu->cmdq;
ÂÂÂÂÂ struct arm_smmu_ll_queue llq = {
ÂÂÂÂÂÂÂÂÂ .max_n_shift = cmdq->q.llq.max_n_shift,
@@ -1387,27 +1388,38 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
ÂÂÂÂÂ /* 1. Allocate some space in the queue */
ÂÂÂÂÂ local_irq_save(flags);
-ÂÂÂ llq.val = READ_ONCE(cmdq->q.llq.val);
ÂÂÂÂÂ do {
ÂÂÂÂÂÂÂÂÂ u64 old;
+ÂÂÂÂÂÂÂ llq.val = READ_ONCE(cmdq->q.llq.val);
-ÂÂÂÂÂÂÂ while (!queue_has_space(&llq, n + sync)) {
+ÂÂÂÂÂÂÂ if (queue_has_space(&llq, n + sync))
+ÂÂÂÂÂÂÂÂÂÂÂ goto try_cas;
+
+ÂÂÂÂÂÂÂ if (locked)
+ÂÂÂÂÂÂÂÂÂÂÂ spin_unlock(&cmdq->slock);
+
+ÂÂÂÂÂÂÂ do {
ÂÂÂÂÂÂÂÂÂÂÂÂÂ local_irq_restore(flags);
ÂÂÂÂÂÂÂÂÂÂÂÂÂ if (arm_smmu_cmdq_poll_until_not_full(smmu, &llq))
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
ÂÂÂÂÂÂÂÂÂÂÂÂÂ local_irq_save(flags);
-ÂÂÂÂÂÂÂ }
+ÂÂÂÂÂÂÂ } while (!queue_has_space(&llq, n + sync));
+try_cas:
ÂÂÂÂÂÂÂÂÂ head.cons = llq.cons;
ÂÂÂÂÂÂÂÂÂ head.prod = queue_inc_prod_n(&llq, n + sync) |
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ CMDQ_PROD_OWNED_FLAG;
ÂÂÂÂÂÂÂÂÂ old = cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val);
-ÂÂÂÂÂÂÂ if (old == llq.val)
+ÂÂÂÂÂÂÂ if (old != llq.val)
ÂÂÂÂÂÂÂÂÂÂÂÂÂ break;
-ÂÂÂÂÂÂÂ llq.val = old;
+ÂÂÂÂÂÂÂ if (!locked) {
+ÂÂÂÂÂÂÂÂÂÂÂ spin_lock(&cmdq->slock);
+ÂÂÂÂÂÂÂÂÂÂÂ locked = true;
+ÂÂÂÂÂÂÂ }
ÂÂÂÂÂ } while (1);
+
ÂÂÂÂÂ owner = !(llq.prod & CMDQ_PROD_OWNED_FLAG);
ÂÂÂÂÂ head.prod &= ~CMDQ_PROD_OWNED_FLAG;
ÂÂÂÂÂ llq.prod &= ~CMDQ_PROD_OWNED_FLAG;
@@ -3192,6 +3204,7 @@ static int arm_smmu_cmdq_init(struct arm_smmu_device *smmu)
ÂÂÂÂÂ atomic_set(&cmdq->owner_prod, 0);
ÂÂÂÂÂ atomic_set(&cmdq->lock, 0);
+ÂÂÂ spin_lock_init(&cmdq->slock);
ÂÂÂÂÂ bitmap = (atomic_long_t *)bitmap_zalloc(nents, GFP_KERNEL);
ÂÂÂÂÂ if (!bitmap) {
.