Re: [PATCH] arm64/io: Don't use WZR in writel

From: Robin Murphy
Date: Tue Mar 19 2019 - 07:46:04 EST


On 18/03/2019 17:24, Robin Murphy wrote:
On 18/03/2019 17:19, Robin Murphy wrote:
On 18/03/2019 17:00, Russell King - ARM Linux admin wrote:
On Mon, Mar 18, 2019 at 04:04:03PM +0000, Robin Murphy wrote:
On 12/03/2019 12:36, Marc Gonzalez wrote:
On 24/02/2019 04:53, Bjorn Andersson wrote:

On Sat 23 Feb 10:37 PST 2019, Marc Zyngier wrote:

On Sat, 23 Feb 2019 18:12:54 +0000, Bjorn Andersson wrote:

On Mon 11 Feb 06:59 PST 2019, Marc Zyngier wrote:

On 11/02/2019 14:29, AngeloGioacchino Del Regno wrote:

Also, just one more thing: yes this thing is going ARM64-wide and
- from my findings - it's targeting certain Qualcomm SoCs, but...
I'm not sure that only QC is affected by that, others may as well
have the same stupid bug.

At the moment, only QC SoCs seem to be affected, probably because
everyone else has debugged their hypervisor (or most likely doesn't
bother with shipping one).

In all honesty, we need some information from QC here: which SoCs are
affected, what is the exact nature of the bug, can it be triggered from
EL0. Randomly papering over symptoms is not something I really like
doing, and is likely to generate problems on unaffected systems.

The bug at hand is that the XZR is not deemed a valid source in the
virtualization of the SMMU registers. It was identified and fixed for
all platforms that are shipping kernels based on v4.9 or later.

When you say "fixed": Do you mean fixed in the firmware? Or by adding
a workaround in the shipped kernel?

I mean that it's fixed in the firmware.

If the former, is this part of an official QC statement, with an
associated erratum number?

I don't know, will get back to you on this one.

Is this really limited to the SMMU accesses?

Yes.

As such Angelo's list of affected platforms covers the high-profile
ones. In particular MSM8996 and MSM8998 is getting pretty good support
upstream, if we can figure out a way around this issue.

We'd need an exhaustive list of the affected SoCs, and work out if we
can limit the hack to the SMMU driver (cc'ing Robin, who's the one
who'd know about it).

I will try to compose a list.

FWIW, I have just been bitten by this issue. I needed to enable an SMMU to
filter PCIe EP accesses to system RAM (or something). I'm using an APQ8098
MEDIABOX dev board. My system hangs in arm_smmu_device_reset() doing:

ÂÂÂÂ/* Invalidate the TLB, just in case */
ÂÂÂÂwritel_relaxed(0, gr0_base + ARM_SMMU_GR0_TLBIALLH);
ÂÂÂÂwritel_relaxed(0, gr0_base + ARM_SMMU_GR0_TLBIALLNSNH);


With the 'Z' constraint, gcc generates:

ÂÂÂÂstr wzr, [x0]

without the 'Z' constraint, gcc generates:

ÂÂÂÂmovÂÂÂ w1, 0
ÂÂÂÂstr w1, [x0]


I can work around the problem using the following patch:

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 045d93884164..93117519aed8 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -59,6 +59,11 @@
ÂÂ #include "arm-smmu-regs.h"
+static inline void qcom_writel(u32 val, volatile void __iomem *addr)
+{
+ÂÂÂ asm volatile("str %w0, [%1]" : : "r" (val), "r" (addr));
+}
+
ÂÂ #define ARM_MMU500_ACTLR_CPREÂÂÂÂÂÂÂ (1 << 1)
ÂÂ #define ARM_MMU500_ACR_CACHE_LOCKÂÂÂ (1 << 26)
@@ -422,7 +427,7 @@ static void __arm_smmu_tlb_sync(struct arm_smmu_device *smmu,
ÂÂ {
ÂÂÂÂÂÂ unsigned int spin_cnt, delay;
-ÂÂÂ writel_relaxed(0, sync);
+ÂÂÂ qcom_writel(0, sync);
ÂÂÂÂÂÂ for (delay = 1; delay < TLB_LOOP_TIMEOUT; delay *= 2) {
ÂÂÂÂÂÂÂÂÂÂ for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (!(readl_relaxed(status) & sTLBGSTATUS_GSACTIVE))
@@ -1760,8 +1765,8 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
ÂÂÂÂÂÂ }
ÂÂÂÂÂÂ /* Invalidate the TLB, just in case */
-ÂÂÂ writel_relaxed(0, gr0_base + ARM_SMMU_GR0_TLBIALLH);
-ÂÂÂ writel_relaxed(0, gr0_base + ARM_SMMU_GR0_TLBIALLNSNH);
+ÂÂÂ qcom_writel(0, gr0_base + ARM_SMMU_GR0_TLBIALLH);
+ÂÂÂ qcom_writel(0, gr0_base + ARM_SMMU_GR0_TLBIALLNSNH);
ÂÂÂÂÂÂ reg = readl_relaxed(ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);



Can a quirk be used to work around the issue?
Or can we just "pessimize" the 3 writes for everybody?
(Might be cheaper than a test anyway)

If it really is just the SMMU driver which is affected, we can work around
it for free (not counting the 'cost' of slightly-weird-looking code, of
course). If the diff below works as expected, I'll write it up properly.

Robin.
----->8-----
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 045d93884164..7ff29e33298f 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -422,7 +422,7 @@ static void __arm_smmu_tlb_sync(struct arm_smmu_device
*smmu,
 {
ÂÂÂÂÂ unsigned int spin_cnt, delay;

-ÂÂÂ writel_relaxed(0, sync);
+ÂÂÂ writel_relaxed((unsigned long)sync, sync);
ÂÂÂÂÂ for (delay = 1; delay < TLB_LOOP_TIMEOUT; delay *= 2) {
ÂÂÂÂÂÂÂÂÂ for (spin_cnt = TLB_SPIN_COUNT; spin_cnt > 0; spin_cnt--) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂ if (!(readl_relaxed(status) & sTLBGSTATUS_GSACTIVE))
@@ -681,7 +681,12 @@ static void arm_smmu_write_context_bank(struct
arm_smmu_device *smmu, int idx)

ÂÂÂÂÂ /* Unassigned context banks only need disabling */
ÂÂÂÂÂ if (!cfg) {
-ÂÂÂÂÂÂÂ writel_relaxed(0, cb_base + ARM_SMMU_CB_SCTLR);
+ÂÂÂÂÂÂÂ /*
+ÂÂÂÂÂÂÂÂ * For Qualcomm reasons, we want to guarantee that we write a
+ÂÂÂÂÂÂÂÂ * zero from a register which is not WZR. Fortunately, the cfg
+ÂÂÂÂÂÂÂÂ * logic here plays right into our hands...
+ÂÂÂÂÂÂÂÂ */
+ÂÂÂÂÂÂÂ writel_relaxed((unsigned long)cfg, cb_base + ARM_SMMU_CB_SCTLR);
ÂÂÂÂÂÂÂÂÂ return;
ÂÂÂÂÂ }

@@ -1760,8 +1765,8 @@ static void arm_smmu_device_reset(struct
arm_smmu_device *smmu)
ÂÂÂÂÂ }

ÂÂÂÂÂ /* Invalidate the TLB, just in case */
-ÂÂÂ writel_relaxed(0, gr0_base + ARM_SMMU_GR0_TLBIALLH);
-ÂÂÂ writel_relaxed(0, gr0_base + ARM_SMMU_GR0_TLBIALLNSNH);
+ÂÂÂ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_TLBIALLH);
+ÂÂÂ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_TLBIALLNSNH);

ÂÂÂÂÂ reg = readl_relaxed(ARM_SMMU_GR0_NS(smmu) + ARM_SMMU_GR0_sCR0);


Given what we've seen from Clang for futex stuff in 32-bit ARM, are
you really sure that the above will not result in Clang still spotting
that the value is zero and using a wzr for all these cases?

The trick is that in the write-only TLBI cases the variable we're passing in really is nonzero, so that can't possibly happen. For the context bank reset, yes, I am assuming that no complier will ever be perverse enough to detect that cfg is not written after the NULL check and immediately reallocate it to XZR for no good reason. I'd like to think that assumption is going to hold for the reasonable scope of this particular workaround, though.

Well, crap. So much for that hubris...


00000000000000f0 <arm_smmu_write_context_bank>:
ÂÂÂÂÂ f0:ÂÂÂÂÂÂ 52800504ÂÂÂÂÂÂÂ movÂÂÂÂ w4, #0x28 // #40
ÂÂÂÂÂ f4:ÂÂÂÂÂÂ f940240aÂÂÂÂÂÂÂ ldrÂÂÂÂ x10, [x0,#72]
ÂÂÂÂÂ f8:ÂÂÂÂÂÂ a9411402ÂÂÂÂÂÂÂ ldpÂÂÂÂ x2, x5, [x0,#16]
ÂÂÂÂÂ fc:ÂÂÂÂÂÂ 9b247c24ÂÂÂÂÂÂÂ smullÂÂ x4, w1, w4
ÂÂÂÂ 100:ÂÂÂÂÂÂ 8b040148ÂÂÂÂÂÂÂ addÂÂÂÂ x8, x10, x4
ÂÂÂÂ 104:ÂÂÂÂÂÂ 1ac52023ÂÂÂÂÂÂÂ lslÂÂÂÂ w3, w1, w5
ÂÂÂÂ 108:ÂÂÂÂÂÂ 8b23c042ÂÂÂÂÂÂÂ addÂÂÂÂ x2, x2, w3, sxtw
ÂÂÂÂ 10c:ÂÂÂÂÂÂ f9401107ÂÂÂÂÂÂÂ ldrÂÂÂÂ x7, [x8,#32]
ÂÂÂÂ 110:ÂÂÂÂÂÂ b5000067ÂÂÂÂÂÂÂ cbnzÂÂÂ x7, 11c <arm_smmu_write_context_bank+0x2c>
ÂÂÂÂ 114:ÂÂÂÂÂÂ b900005fÂÂÂÂÂÂÂ strÂÂÂÂ wzr, [x2]


Time to come up with a better SCTLR reset value, I guess.

Hmm, or perhaps not... The hangs are all reported in the TLB maintenance calls, and by the time we get to the first of those we've already written the context banks with their initial disabled value, which implies that that particular use of WZR isn't a problem (despite being the only one which actually consumes the value stored; how bizarre).

Robin.