Re: [PATCH v5] cpu/hotplug: Do not bail-out in DYING/STARTING sections
From: Thorsten Leemhuis
Date: Tue Sep 20 2022 - 05:59:16 EST
On 17.08.22 11:46, Thorsten Leemhuis wrote:
>
> Hi, this is your Linux kernel regression tracker.
>
> On 25.07.22 11:59, Vincent Donnefort wrote:
>> The DYING/STARTING callbacks are not expected to fail. However, as reported
>> by Derek, drivers such as tboot are still free to return errors within
>> those sections, which halts the hot(un)plug and leaves the CPU in an
>> unrecoverable state.
>>
>> No rollback being possible there, let's only log the failures and proceed
>> with the following steps. This restores the hotplug behaviour prior to
>> commit 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()")
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867
>> Fixes: 453e41085183 ("cpu/hotplug: Add cpuhp_invoke_callback_range()")
>> Reported-by: Derek Dolney <z23@xxxxxxxxxx>
>> Signed-off-by: Vincent Donnefort <vdonnefort@xxxxxxxxxx>
>> Tested-by: Derek Dolney <z23@xxxxxxxxxx>
>
> What's the status here? Did that patch to fixing a regression fall
> through the cracks? It looks like nothing happened for 3 weeks now,
> that's why I wondered, but maybe I missed something.
Hmm, Vincent seems to be MIA, at least I see no recent messages from him
on lore. Odd. But well, it's still a fix for a regression and it's up to
v5 already; Valentin already added his Reviewed-by, too. Would be a
shame to waste this.
Thomas, could you maybe take a look at the patch? Maybe we're lucky and
the patch is already good to go...
Ciao, Thorsten
#regzbot poke
> P.S.: As the Linux kernel's regression tracker I deal with a lot of
> reports and sometimes miss something important when writing mails like
> this. If that's the case here, don't hesitate to tell me in a public
> reply, it's in everyone's interest to set the public record straight.
>
>
>> v4 -> v5:
>> - Remove WARN, only log broken states with pr_warn.
>> v3 -> v4:
>> - Sorry ... wrong commit description style ...
>> v2 -> v3:
>> - Tested-by tag.
>> - Refine commit description.
>> - Bugzilla link.
>> v1 -> v2:
>> - Commit message rewording.
>> - More details in the warnings.
>> - Some variable renaming
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index bbad5e375d3b..621e5af42d57 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
>> return true;
>> }
>>
>> -static int cpuhp_invoke_callback_range(bool bringup,
>> - unsigned int cpu,
>> - struct cpuhp_cpu_state *st,
>> - enum cpuhp_state target)
>> +static int __cpuhp_invoke_callback_range(bool bringup,
>> + unsigned int cpu,
>> + struct cpuhp_cpu_state *st,
>> + enum cpuhp_state target,
>> + bool nofail)
>> {
>> enum cpuhp_state state;
>> - int err = 0;
>> + int ret = 0;
>>
>> while (cpuhp_next_state(bringup, &state, st, target)) {
>> + int err;
>> +
>> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
>> - if (err)
>> + if (!err)
>> + continue;
>> +
>> + if (nofail) {
>> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
>> + cpu, bringup ? "UP" : "DOWN",
>> + cpuhp_get_step(st->state)->name,
>> + st->state, err);
>> + ret = -1;
>> + } else {
>> + ret = err;
>> break;
>> + }
>> }
>>
>> - return err;
>> + return ret;
>> +}
>> +
>> +static inline int cpuhp_invoke_callback_range(bool bringup,
>> + unsigned int cpu,
>> + struct cpuhp_cpu_state *st,
>> + enum cpuhp_state target)
>> +{
>> + return __cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
>> +}
>> +
>> +static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
>> + unsigned int cpu,
>> + struct cpuhp_cpu_state *st,
>> + enum cpuhp_state target)
>> +{
>> + __cpuhp_invoke_callback_range(bringup, cpu, st, target, true);
>> }
>>
>> static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
>> @@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
>> struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>> enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
>> int err, cpu = smp_processor_id();
>> - int ret;
>>
>> /* Ensure this CPU doesn't handle any more interrupts. */
>> err = __cpu_disable();
>> @@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
>> */
>> WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
>>
>> - /* Invoke the former CPU_DYING callbacks */
>> - ret = cpuhp_invoke_callback_range(false, cpu, st, target);
>> -
>> /*
>> + * Invoke the former CPU_DYING callbacks
>> * DYING must not fail!
>> */
>> - WARN_ON_ONCE(ret);
>> + cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
>>
>> /* Give up timekeeping duties */
>> tick_handover_do_timer();
>> @@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
>> {
>> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
>> enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
>> - int ret;
>>
>> rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
>> cpumask_set_cpu(cpu, &cpus_booted_once_mask);
>> - ret = cpuhp_invoke_callback_range(true, cpu, st, target);
>>
>> /*
>> * STARTING must not fail!
>> */
>> - WARN_ON_ONCE(ret);
>> + cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
>> }
>>
>> /*