Re: [PATCH RESEND v3 01/17] EDAC/synopsys: Fix native uMCTL2 IRQs handling procedure

From: Borislav Petkov
Date: Mon Oct 31 2022 - 13:25:04 EST


On Fri, Sep 30, 2022 at 02:26:56AM +0300, Serge Semin wrote:
> The generic DW uMCTL2 DDRC v3.x support was added in commit f7824ded4149
> ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR"). It
> hasn't been done quiet well there with respect to the IRQs handling
> procedure. An attempt to fix that was introduced in the recent commit
> 4bcffe941758 ("EDAC/synopsys: Re-enable the error interrupts on v3 hw").
> Alas again it didn't provide quite complete solution.

Because?

Btw, for the future, you should make sure you add those commit authors
to Cc so that they can get a chance to comment. Adding the folks from
that commit to Cc.

> First of all the commit f7824ded4149 ("EDAC/synopsys: Add support for
> version 3 of the Synopsys EDAC DDR") log says that v3.80a "has UE/CE auto
> cleared". They aren't in none of the IP-core versions.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What does that sentence mean exactly? The UE/CE auto clearing
functionality is not in that silicon?

> The IRQ status can be cleared by means of setting the ECCCLR/ECCCTL
> register self-cleared flags 0-3.

I'm guessing that's this reg:

/* ECC control register */
#define ECC_CTRL_OFST 0xC4

?

> The pending IRQ clearance is done in the respective get_error_info()
> method of the driver. Thus defining a quirk flag with the
> "DDR_ECC_INTR_SELF_CLEAR" name was at least very inaccurate if not to
> say misleading.
>
> So was adding the comments about the "ce/ue bits automatically
> cleared".

Aah, you mean that the ->get_error_info() functions are doing the
clearing even if something should be doing self clearing. And that
DDR_ECC_INTR_SELF_CLEAR thing is queried when enabling the error
interrupt which is just bad naming because it looks like that quirk
controls what register to write/read.

> Second, disabling the being handled IRQ in the handler doesn't make sense
> in Linux since the IC line is masked during that procedure anyway. So
> disabling the IRQ in one part of the handler and enabling it at the end of
> the method is simply redundant. (See, the ZynqMP-specific code with the
> QoS IRQ CSR didn't do that originally.)

So what is this commit message of

4bcffe941758 ("EDAC/synopsys: Re-enable the error interrupts on v3 hw")

then talking about:

"Then the interrupt handler will be called only once."

How is that interrupt supposed to be reenabled?

> Finally calling the zynqmp_get_error_info() method concurrently with the
> enable_irq()/disable_irq() functions causes the IRQs mask state race
> condition. Starting from DW uMCTL2 DDRC IP-core v3.10a [1] the ECCCLR
> register has been renamed to ECCCTL and has been equipped with CE/UE IRQs
> enable/disable flags [2].

Aha, that sounds like the right thing to toggle.

> So the CSR now serves for the IRQ status and control functions used
> concurrently during the IRQ handling and the IRQ disabling/enabling.
> Thus the corresponding critical section must be protected with the
> IRQ-safe spin-lock.

> So let's fix all the problems noted above. First the
> DDR_ECC_INTR_SELF_CLEAR flag is renamed to SYNPS_ZYNQMP_IRQ_REGS.

"Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to
do frotz", as if you are giving orders to the codebase to change its
behaviour."

In this case, pls formulate it something like this:

"So fix all these problems noted above: rename DDR_ECC_INTR_SELF_CLEAR
to SYNPS_ZYNQMP_IRQ_REGS to denote that, ..."

And so on.

> Its semantic is now the opposite: the quirk means having the ZynqMP
> IRQ CSRs available on the platform.

Yes, that makes more sense.

> Second the DDR_UE_MASK and DDR_CE_MASK macros
> are renamed to imply being used in the framework of the ECCCLR/ECCCTL CSRs
> accesses. Third all the misleading comments are removed. Finally the
> ECC_CLR_OFST register IOs are now protected with the IRQ-safe spin-lock
> taken in order to prevent the IRQ status clearance and IRQ enable/disable
> race condition.
>
> [1] DesignWare Cores Enhanced Universal DDR Memory and Protocol
> Controllers (uMCTL2/uPCTL2), Release Notes, Version 3.91a, October 2020,
> p. 27.
> [2] DesignWare® Cores Enhanced Universal DDR Memory Controller (uMCTL2),
> Databook Version 3.91a, October 2020, p.818-819.

If those are not publicly accessible, then there's no point to put them
in here.

> Fixes: f7824ded4149 ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR")
> Signed-off-by: Serge Semin <Sergey.Semin@xxxxxxxxxxxxxxxxxxxx>

Does this need to go to stable@ and thus older kernels?

> ---
> drivers/edac/synopsys_edac.c | 76 +++++++++++++++++++++++-------------
> 1 file changed, 48 insertions(+), 28 deletions(-)

> @@ -300,6 +299,7 @@ struct synps_ecc_status {
> /**
> * struct synps_edac_priv - DDR memory controller private instance data.
> * @baseaddr: Base address of the DDR controller.
> + * @lock: Concurrent CSRs access lock.
> * @message: Buffer for framing the event specific info.
> * @stat: ECC status information.
> * @p_data: Platform data.
> @@ -314,6 +314,7 @@ struct synps_ecc_status {
> */
> struct synps_edac_priv {
> void __iomem *baseaddr;
> + spinlock_t lock;

That lock needs to be named properly and have a comment above it what it
protects.

> char message[SYNPS_EDAC_MSG_SIZE];
> struct synps_ecc_status stat;
> const struct synps_platform_data *p_data;

...

> @@ -516,24 +523,42 @@ static void handle_error(struct mem_ctl_info *mci, struct synps_ecc_status *p)
>
> static void enable_intr(struct synps_edac_priv *priv)
> {
> + unsigned long flags;
> +
> /* Enable UE/CE Interrupts */
> - if (priv->p_data->quirks & DDR_ECC_INTR_SELF_CLEAR)
> - writel(DDR_UE_MASK | DDR_CE_MASK,
> - priv->baseaddr + ECC_CLR_OFST);
> - else
> + if (priv->p_data->quirks & SYNPS_ZYNQMP_IRQ_REGS) {
> writel(DDR_QOSUE_MASK | DDR_QOSCE_MASK,
> priv->baseaddr + DDR_QOS_IRQ_EN_OFST);
>
> + return;
> + }
> +
> + /* IRQs Enable/Disable feature has been available since v3.10a */

How does this comment help here?

If it is available since a version number, why doesn't the below check a
version number? IOW, what is the relevance of that comment here?

In any case, I need to hear from this driver's maintainer too.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette