Re: [PATCH RFC 1/1] arm64: Use PSCI calls for CPU stop when hotplug is supported

From: Robin Murphy
Date: Fri Jan 25 2019 - 10:56:48 EST


On 25/01/2019 07:03, Pramod Kumar wrote:
On Wed, Jan 23, 2019 at 11:03 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:

On Wed, Jan 23, 2019 at 09:05:26AM -0800, Scott Branden wrote:
Hi Mark,

Hopefully I can shed some light on the use case inline.

On 2019-01-23 8:48 a.m., Mark Rutland wrote:
On Mon, Jan 21, 2019 at 11:30:02AM +0530, Pramod Kumar wrote:
On Mon, Jan 21, 2019 at 11:28 AM Pramod Kumar <pramod.kumar@xxxxxxxxxxxx>
wrote:

Need comes from a specific use case where one Accelerator card(SoC) is
plugged in a sever over a PCIe interface. This Card gets supply from a
battery, which could provide very less power for a very small time, in case
of any power loss. Once Card switches to battery, this has to reduce its
power consumption to its lowest point and back-up the DDR contents asap
before battery gets fully drained off.
In this example is Linux running on the server, or on the accelerator?
Accelerator

What precisely are you trying to back up from DDR, and why?
Data in DDR is being written to disk at this time (disk is connected to
accelerator)

What is responsible for backing up that contents?

A low power M-class processor and DMA engine which continues necessary
operations to transfer DDR memory to disk.

The high power processors on the accelerator running linux needed to be
halted ASAP on this power loss event and M0 take over. Graceful shutdown of
linux and other peripherals is unnecessary (and we don't have the power
necessary to do so).

If graceful shutdown of Linux is not required (and is in fact
undesireable), why is Linux involved at all in this shutdown process?

For example, why is this not a secure interrupt taken to EL3, which can
(gracefully) shut down the CPUs regardless?


This is an GPIO interrupt. This can not be marked secure as for that
we need to mark whole GPIO controller as secure which is not possible
as GPIO controller is meant for non-secure world having more than 100
lines connected.

I agree we have work around where we invoke handler in Linux and
switch to ATF via SMC and from ATF we need bring all secondary CPU to
ATF via sending SGI and and then respective core flushes the L1/L2 and
bring himself out of coherency domain and cluster and MCU shutdowns
the CPU subsystem gracefully. This could work for our requirement.
Need to check ATF support for that.

Right, SMCCC has whole spaces for SoC-specific and platform-specific service calls. If your system has a need to power off as fast as possible under system-specific constraints, it seems much more sensible to immediately tell the firmware "power off as fast as possible under the system-specific constraints that you have full knowledge of, please", rather than trying to coax the generic kernel_halt() (or whatever) infrastructure to sort-of-do-what-you-want.

But What about generic system? This patch address the generic
multi-master system's requirement. Consider system where shutting down
the linux does not mean shutting down the complete system. Lets take
an example of smartnic case Where NIC master and CPUs access cachable
DDR. In smarnic its quite common to bring CPUs on demand means when
needed via MCU help.
Now in full-fledged system. if CPU subsystem is shutdown via poweroff
command which does not bring secondary CPUs out of coherency domain,
it will bring the complete system unstable when NIC master tries to
access DDR and snoop is send to CPUs as well which is not available.
Fabric/System hangs...

Not sure that's really relevant here... If platform firmware is able to power things off in a way that breaks the platform, surely that's entirely the firmware's own fault.

I feel While shutting down the CPUs subsystem or powering off, All
secondary CPUs must be shutdown properly by bring-out of coherency
domain to remain rest of subsystem usable. I agree that introducing
PSCI call introduce delay for shutdown/reboot case but stability
matter than little delay.

Again, if you don't trust the firmware to implement SYSTEM_OFF appropriately for the platform, can you really assume its CPU_OFF implementation is safe either?

People already complain today about how long CPU bringup takes on certain systems. Extending their reboot cycle by a similar degree for reasons that are entirely irrelevant to those systems is hardly going to make those users any happier.

Robin.