Re: [PATCH v5] cpu/hotplug: Do not bail-out in DYING/STARTING sections
From: Thorsten Leemhuis
Date: Wed Aug 17 2022 - 05:46:39 EST
[CCing boris]
Hi, this is your Linux kernel regression tracker.
On 25.07.22 11:59, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections, which halts the hot(un)plug and leaves the CPU in an
> unrecoverable state.
>
> No rollback being possible there, let's only log the failures and proceed
> with the following steps. This restores the hotplug behaviour prior to
> commit 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()")
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867
> Fixes: 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()")
> Reported-by: Derek Dolney <z23@xxxxxxxxxx>
> Signed-off-by: Vincent Donnefort <vdonnefort@xxxxxxxxxx>
> Tested-by: Derek Dolney <z23@xxxxxxxxxx>
What's the status here? Did that patch to fixing a regression fall
through the cracks? It looks like nothing happened for 3 weeks now,
that's why I wondered, but maybe I missed something.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
> v4 -> v5:
> - Remove WARN, only log broken states with pr_warn.
> v3 -> v4:
> - Sorry ... wrong commit description style ...
> v2 -> v3:
> - Tested-by tag.
> - Refine commit description.
> - Bugzilla link.
> v1 -> v2:
> - Commit message rewording.
> - More details in the warnings.
> - Some variable renaming
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bbad5e375d3b..621e5af42d57 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
> return true;
> }
>
> -static int cpuhp_invoke_callback_range(bool bringup,
> - unsigned int cpu,
> - struct cpuhp_cpu_state *st,
> - enum cpuhp_state target)
> +static int __cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target,
> + bool nofail)
> {
> enum cpuhp_state state;
> - int err = 0;
> + int ret = 0;
>
> while (cpuhp_next_state(bringup, &state, st, target)) {
> + int err;
> +
> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> - if (err)
> + if (!err)
> + continue;
> +
> + if (nofail) {
> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> + cpu, bringup ? "UP" : "DOWN",
> + cpuhp_get_step(st->state)->name,
> + st->state, err);
> + ret = -1;
> + } else {
> + ret = err;
> break;
> + }
> }
>
> - return err;
> + return ret;
> +}
> +
> +static inline int cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target)
> +{
> + return __cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
> +}
> +
> +static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target)
> +{
> + __cpuhp_invoke_callback_range(bringup, cpu, st, target, true);
> }
>
> static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
> @@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
> struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> int err, cpu = smp_processor_id();
> - int ret;
>
> /* Ensure this CPU doesn't handle any more interrupts. */
> err = __cpu_disable();
> @@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
> */
> WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
>
> - /* Invoke the former CPU_DYING callbacks */
> - ret = cpuhp_invoke_callback_range(false, cpu, st, target);
> -
> /*
> + * Invoke the former CPU_DYING callbacks
> * DYING must not fail!
> */
> - WARN_ON_ONCE(ret);
> + cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
>
> /* Give up timekeeping duties */
> tick_handover_do_timer();
> @@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
> {
> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
> - int ret;
>
> rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
> cpumask_set_cpu(cpu, &cpus_booted_once_mask);
> - ret = cpuhp_invoke_callback_range(true, cpu, st, target);
>
> /*
> * STARTING must not fail!
> */
> - WARN_ON_ONCE(ret);
> + cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
> }
>
> /*