Re: [PATCH v3 1/2] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

From: Leizhen (ThunderTown)
Date: Thu Aug 16 2018 - 00:20:09 EST




On 2018/8/16 2:08, John Garry wrote:
> On 15/08/2018 14:00, Will Deacon wrote:
>> On Wed, Aug 15, 2018 at 01:26:31PM +0100, Robin Murphy wrote:
>>> On 15/08/18 11:23, Zhen Lei wrote:
>>>> The condition "(int)(VAL - sync_idx) >= 0" to break loop in function
>>>> __arm_smmu_sync_poll_msi requires that sync_idx must be increased
>>>> monotonously according to the sequence of the CMDs in the cmdq.
>>>>
>>>> But ".msidata = atomic_inc_return_relaxed(&smmu->sync_nr)" is not protected
>>>> by spinlock, so the following scenarios may appear:
>>>> cpu0 cpu1
>>>> msidata=0
>>>> msidata=1
>>>> insert cmd1
>>>> insert cmd0
>>>> smmu execute cmd1
>>>> smmu execute cmd0
>>>> poll timeout, because msidata=1 is overridden by
>>>> cmd0, that means VAL=0, sync_idx=1.
>>>>
>>>> This is not a functional problem, just make the caller wait for a long
>>>> time until TIMEOUT. It's rare to happen, because any other CMD_SYNCs
>>>> during the waiting period will break it.
>>>>
>>>> Signed-off-by: Zhen Lei <thunder.leizhen@xxxxxxxxxx>
>>>> ---
>>>> drivers/iommu/arm-smmu-v3.c | 12 ++++++++----
>>>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index 1d64710..3f5c236 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -566,7 +566,7 @@ struct arm_smmu_device {
>>>>
>>>> int gerr_irq;
>>>> int combined_irq;
>>>> - atomic_t sync_nr;
>>>> + u32 sync_nr;
>>>>
>>>> unsigned long ias; /* IPA */
>>>> unsigned long oas; /* PA */
>>>> @@ -775,6 +775,11 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 *ent)
>>>> return 0;
>>>> }
>>>>
>>>> +static inline void arm_smmu_cmdq_sync_set_msidata(u64 *cmd, u32 msidata)
>>>
>>> If we *are* going to go down this route then I think it would make sense to
>>> move the msiaddr and CMDQ_SYNC_0_CS_MSI logic here as well; i.e.
>>> arm_smmu_cmdq_build_cmd() always generates a "normal" SEV-based sync
>>> command, then calling this guy would convert it to an MSI-based one. As-is,
>>> having bits of mutually-dependent data handled across two separate places
>>> just seems too messy and error-prone.
>>
>> Yeah, but I'd first like to see some number showing that doing all of this
>> under the lock actually has an impact.
>
> Update:
>
> I tested this patch versus a modified version which builds the command under the queue spinlock (* below). From my testing there is a small difference:
>
> Setup:
> Testing Single NVME card
> fio 15 processes
> No process pinning
>
> Average Results:
> v3 patch read/r,w/write (IOPS): 301K/149K,149K/307K
> Build under lock version read/r,w/write (IOPS): 304K/150K,150K/311K
>
> I don't know why it's better to build under the lock. We can test more.

I have analysed the assembly code, the memset will be optimized as Robin said to be "stp xzr, xzr, [x0]",
and the switch..case is as below:
ffff0000085e5744 <arm_smmu_cmdq_build_cmd>:
ffff0000085e5744: a9007c1f stp xzr, xzr, [x0] //memset
ffff0000085e5748: 39400023 ldrb w3, [x1]
ffff0000085e574c: f9400002 ldr x2, [x0]
ffff0000085e5750: aa020062 orr x2, x3, x2
ffff0000085e5754: f9000002 str x2, [x0]
ffff0000085e5758: 39400023 ldrb w3, [x1] //ent->opcode
ffff0000085e575c: 51000463 sub w3, w3, #0x1
ffff0000085e5760: 7101147f cmp w3, #0x45
ffff0000085e5764: 54000069 b.ls ffff0000085e5770
ffff0000085e5768: 12800023 mov w3, #0xfffffffe
ffff0000085e576c: 1400000e b ffff0000085e57a4
ffff0000085e5770: b0003024 adrp x4, ffff000008bea000
ffff0000085e5774: 91096084 add x4, x4, #0x258 //static table in rodata
ffff0000085e5778: 38634883 ldrb w3, [x4,w3,uxtw] //use ent->opcode as index
ffff0000085e577c: 10000064 adr x4, ffff0000085e5788
ffff0000085e5780: 8b238883 add x3, x4, w3, sxtb #2
ffff0000085e5784: d61f0060 br x3 //jump to "case xxx:"

Actually, after apply the patch "inline arm_smmu_cmdq_build_cmd" sent by Robin, the memset and static table will be removed:
ffff0000085e68a8: 94123207 bl ffff000008a730c4 <_raw_spin_lock_irqsave>
ffff0000085e68ac: b9410ad5 ldr w21, [x22,#264]
ffff0000085e68b0: aa0003fa mov x26, x0
ffff0000085e68b4: 110006b5 add w21, w21, #0x1 //++smmu->sync_nr
ffff0000085e68b8: b9010ad5 str w21, [x22,#264]
ffff0000085e68bc: b50005f3 cbnz x19, ffff0000085e6978 //if (ent->sync.msiaddr)
ffff0000085e68c0: d28408c2 mov x2, #0x2046
ffff0000085e68c4: f2a1f802 movk x2, #0xfc0, lsl #16 //the constant part of CMD_SYNC
ffff0000085e68c8: aa158042 orr x2, x2, x21, lsl #32 //or msidata
ffff0000085e68cc: aa1603e0 mov x0, x22 //x0 = x22 = smmu
ffff0000085e68d0: 910163a1 add x1, x29, #0x58 //x1 = the address of local variable "cmd"
ffff0000085e68d4: f9002fa2 str x2, [x29,#88] //save cmd[0]
ffff0000085e68d8: 927ec673 and x19, x19, #0xffffffffffffc
ffff0000085e68dc: f90033b3 str x19, [x29,#96] //save cmd[1]
ffff0000085e68e0: 97fffd0d bl ffff0000085e5d14 <arm_smmu_cmdq_insert_cmd>

So that, my patch v2 plus Robin's "inline arm_smmu_cmdq_build_cmd()" is a good choice.

But the assembly code of my patch v3, it seems that is still shorter than above:
ffff0000085e695c: 9412320a bl ffff000008a73184 <_raw_spin_lock_irqsave>
ffff0000085e6960: aa0003f6 mov x22, x0
ffff0000085e6964: b9410a62 ldr w2, [x19,#264]
ffff0000085e6968: aa1303e0 mov x0, x19
ffff0000085e696c: f94023a3 ldr x3, [x29,#64]
ffff0000085e6970: 910103a1 add x1, x29, #0x40
ffff0000085e6974: 11000442 add w2, w2, #0x1 //++smmu->sync_nr
ffff0000085e6978: b9010a62 str w2, [x19,#264]
ffff0000085e697c: b9005ba2 str w2, [x29,#88]
ffff0000085e6980: aa028062 orr x2, x3, x2, lsl #32
ffff0000085e6984: f90023a2 str x2, [x29,#64]
ffff0000085e6988: 97fffd58 bl ffff0000085e5ee8 <arm_smmu_cmdq_insert_cmd>

>
> I suppose there is no justification to build the command outside the spinlock based on these results alone...
>
> Cheers,
> John
>
> * Modified version:
> static int __arm_smmu_cmdq_issue_sync_msi(struct arm_smmu_device *smmu)
> {
> u64 cmd[CMDQ_ENT_DWORDS];
> unsigned long flags;
> struct arm_smmu_cmdq_ent ent = {
> .opcode = CMDQ_OP_CMD_SYNC,
> .sync = {
> .msiaddr = virt_to_phys(&smmu->sync_count),
> },
> };
>
> spin_lock_irqsave(&smmu->cmdq.lock, flags);
> ent.sync.msidata = ++smmu->sync_nr;
> arm_smmu_cmdq_build_cmd(cmd, &ent);
> arm_smmu_cmdq_insert_cmd(smmu, cmd);
> spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
>
> return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
> }
>
>
>> Will
>>
>> .
>>
>
>
>
> .
>

--
Thanks!
BestRegards