Re: Query regarding pseudo nmi support on GIC V3 and request_nmi()

From: Neeraj Upadhyay
Date: Fri May 08 2020 - 09:34:32 EST


Hi Marc,

On 5/8/2020 6:23 PM, Marc Zyngier wrote:
On Fri, 8 May 2020 18:09:00 +0530
Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> wrote:

Hi Marc,

On 5/8/2020 5:57 PM, Marc Zyngier wrote:
On Fri, 8 May 2020 16:36:42 +0530
Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> wrote:
Hi Marc,

On 5/8/2020 4:15 PM, Marc Zyngier wrote:
On Thu, 07 May 2020 17:06:19 +0100,
Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> wrote:

Hi,

I have one query regarding pseudo NMI support on GIC v3; from what I
could understand, GIC v3 supports pseudo NMI setup for SPIs and PPIs.
However the request_nmi() in irq framework requires NMI to be per cpu
interrupt source (it checks for IRQF_PERCPU). Can you please help
understand this part, how SPIs can be configured as NMIs, if there is
a per cpu interrupt source restriction?

Let me answer your question by another question: what is the semantic
of a NMI if you can't associate it with a particular CPU?
>>
I was actually thinking of a use case, where, we have a watchdog
interrupt (which is a SPI), which is used for detecting software
hangs and cause device reset; If that interrupt's current cpu
affinity is on a core, where interrupts are disabled, we won't be
able to serve it; so, we need to group that interrupt as an fiq;

Linux doesn't use Group-0 interrupts, as they are strictly secure
(unless your SoC doesn't have EL3, which I doubt).

Yes, we handle that watchdog interrupt as a Group-0 interrupt, which
is handled as fiq in EL3.

I was thinking, if its feasible to mark that interrupt as pseudo
NMI and route it to EL1 as irq. However, looks like that is not the
semantic of a NMI and we would need something like pseudo NMI ipi
for this.

Sending a NMI IPI from another NMI handler? Even once I've added
these, there is no way this will work for that particular scenario.
Just look at the restrictions we impose on NMIs.

Sending a pseudo NMI IPI (to EL1) from fiq handler (which runs in
EL3); I will check, but do you think, that might not work?

How do you know, from EL3, what to write in memory so that the NMI
handler knows what you want to do? Are you going to parse the S1 page
tables? Hard-code the behaviour of some random Linux version in your
legendary non-updatable firmware? This isn't an acceptable behaviour.


Ok, I understand;

Initial thought was to use watchdog SPI as pseudo NMI; however, as pseudo NMIs are only per CPU sources, we were exploring the possibility of using an unused ipi (using the work which is done in [1] and [2] for SGIs) as pseudo NMI, which EL3 sends to EL1, on receiving watchdog fiq. The pseudo NMI handler would collect required debug information, to help indentify the lockup cause. We weren't thinking of communicating any information from EL3 fiq handler to EL1.

However, from this discussion, I realize that calling irq handler from
fiq handler, would not be possible. So, the approach looks flawed.

I believe, allowing a non-percpu pseudo NMI is not acceptable to community?



[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/gic-sgi

[2] https://lkml.org/lkml/2020/4/25/135

An IPI is between two CPUs used by the same SW entitiy. What runs at
EL3 is completely alien to Linux, and so is Linux to EL3. If you want
to IPI, send Group-0 IPIs that are private to the firmware.


Ok got it; however, I wonder what's the use case of sending
SGI to EL1, from secure world, using ICC_ASGI1R. I thought it
allowed communication between EL1 and EL3; but, looks like I understood in wrong.

If you want to inject NMI-type exceptions into EL1, you can always try
SDEI (did I actually write this? Help!). But given your use case below,
that wouldn't work either.


Ok.

Frankly, if all you need to do is to reset the SoC, use EL3
firmware. That is what it is for.

Before triggering SoC reset, we want to collect certain EL1 debug
information like stack trace for CPUs and other debug information.

Frankly, if you are going to reset the SoC because EL1/EL2 has gone
bust, how can you trust it to do anything sensible when injecting an
interrupt?. Once you take a SPI at EL3, gather whatever state you want
from EL3. Do not involve EL1 at all.

M.


Agree that it might not work for all cases. But, for the cases like, some kernel code is stuck after disabling local irqs; pseudo NMI might still be able to run and capture stack and other debug info, to help detect the cause of lockups.


Thanks
Neeraj

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation