Re: [PATCH 0/2] measure latency of cpu hotplug path
From: Peter Zijlstra
Date: Mon Sep 28 2020 - 03:40:36 EST
On Sun, Sep 27, 2020 at 07:41:45PM -0700, psodagud@xxxxxxxxxxxxxx wrote:
> On 2020-09-24 07:58, Steven Rostedt wrote:
> > On Thu, 24 Sep 2020 10:34:14 +0200
> > peterz@xxxxxxxxxxxxx wrote:
> >
> > > On Wed, Sep 23, 2020 at 04:37:44PM -0700, Prasad Sodagudi wrote:
> > > > There are all changes related to cpu hotplug path and would like to seek
> > > > upstream review. These are all patches in Qualcomm downstream kernel
> > > > for a quite long time. First patch sets the rt prioity to hotplug
> > > > task and second patch adds cpuhp trace events.
> > > >
> > > > 1) cpu-hotplug: Always use real time scheduling when hotplugging a CPU
> > > > 2) cpu/hotplug: Add cpuhp_latency trace event
> > >
> > > Why? Hotplug is a known super slow path. If you care about hotplug
> > > latency you're doing it wrong.
> Hi Peter,
>
> [PATCH 1/2] cpu/hotplug: Add cpuhp_latency trace event -
> 1) Tracing of the cpuhp operation is important to find whether upstream
> changes or out of tree modules(or firmware changes) caused latency
> regression or not.
This is a contradiction in terms, it is impossible to have a latency
regression is you don't care about the latency in this super slow path
to begin with.
> 2) Secondary cpus are hotplug out during the device suspend and hotplug in
> during the resume.
Indeed they are.
> 3) firmware(psci calls handling from firmware) changes impact need to be
> tested right?
Firmware is firmware, it's broken by design and we can't fix it if it's
broken. The only sane solution is not having firmware :-)
> 4) cpu hotplug framework(CPUHP_AP_ONLINE_DYN) dynamic callbacks may impact
> the hotplug latency.
Again, nobody cares.
> [PATCH 2/2] cpu-hotplug: Always use real time scheduling when hotplugging a
> CPU –
>
> CPU hotplug operation is stressed and while stress testing with full load on
> the system following problem is observed.
> CPU hotplug operations take place in preemptible context. This leaves the
> hotplugging thread at the mercy of overall system load and CPU
> availability. If the hotplugging thread does not get an opportunity to
> execute after it has already begun a hotplug operation, CPUs can
> end up being stuck in a quasi online state. In the worst case a CPU can be
> stuck in a state where the migration thread is parked while
> another task is executing and changing affinity in a loop. This combination
> can result in unbounded execution time for the running
> task until the hot plugging thread gets the chance to run to complete the
> hotplug operation.
How is that not an administration problem?
Also, you shouldn't be able to change your affinity _to_ a CPU that's
going down. One of the very first steps in hotplug ensures that.