Re: [PATCH] platform/x86/intel-uncore-freq: fix inconsistent state on init failure

From: liu shouye
Date: Tue Apr 15 2025 - 02:56:38 EST


srinivas pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx> 于2025年4月15日周二 00:08写道:
>
> On Mon, 2025-04-14 at 13:41 +0300, Ilpo Järvinen wrote:
> > On Mon, 14 Apr 2025, shouyeliu wrote:
> >
> > > When uncore_event_cpu_online() fails to initialize a control CPU
> > > (e.g.,
> > > due to memory allocation failure or uncore_freq_add_entry()
> > > errors),
> > > the code leaves stale entries in uncore_cpu_mask after that online
> > > CPU
> > > will not try to call uncore_freq_add_entry, resulting in no sys
> > > interface.
> >
> > Please add () after any name that refers to a C function (you're not
> > even
> > being consistent here as you had it in some cases but not here).
ok,I will modify it in the next version
> >
> > Please try to split the very long sentence a bit and make it more
> > obvious
> > what causes what as the current wording is a bit vague, did you mean:
> > uncore_event_cpu_online() will not call uncore_freq_add_entry() for
> > another CPU that is being onlined or something along those lines?
> >
> > Will this change work/matter? Documentation/core-api/cpu_hotplug.rst
> > says
> > about cpuhp_setup_state():
> >
> > "If a callback fails for CPU N then the teardown callback for CPU
> > 0 .. N-1 is invoked to rollback the operation. The state setup
> > fails,
> > the callbacks for the state are not installed and in case of dynamic
> > allocation the allocated state is freed."
> >
>
> Yes, cpuhp_setup_state() will fail and which will result in clean up.
> So any fail of any fail uncore_event_cpu_online() will result in no sys
> entries.
>
> I think here the intention is to keep sys entries, which will not
> happen with this patch.
>
> For confirmation on 6.14 kernel, I forced failure on CPU 10:
>
> [595799.696873] intel_uncore_init
> [595799.700102] uncore_event_cpu_online cpu:0
> [595799.704240] uncore_event_cpu_online cpu:1
> [595799.708360] uncore_event_cpu_online cpu:2
> [595799.712505] uncore_event_cpu_online cpu:3
> [595799.716633] uncore_event_cpu_online cpu:4
> [595799.720755] uncore_event_cpu_online cpu:5
> [595799.724953] uncore_event_cpu_online cpu:6
> [595799.729158] uncore_event_cpu_online cpu:7
> [595799.733409] uncore_event_cpu_online cpu:8
> [595799.737674] uncore_event_cpu_online cpu:9
> [595799.741954] uncore_event_cpu_online cpu:10
> [595799.746134] Force CPU 10 to fail online
> [595799.750182] uncore_event_cpu_offline cpu:0
> [595799.754508] uncore_event_cpu_offline cpu:1
> [595799.758834] uncore_event_cpu_offline cpu:2
> [595799.763238] uncore_event_cpu_offline cpu:3
> [595799.767558] uncore_event_cpu_offline cpu:4
> [595799.771832] uncore_event_cpu_offline cpu:5
> [595799.776178] uncore_event_cpu_offline cpu:6
> [595799.780506] uncore_event_cpu_offline cpu:7
> [595799.784862] uncore_event_cpu_offline cpu:8
> [595799.789247] uncore_event_cpu_offline cpu:9
> [595799.793540] intel_uncore_init cpuhp_setup_state failed
> [595799.798776] intel_uncore_init failed
>
>
> Thanks,
> Srinivas
Registering the CPU hot-plug callback function during booting can be
handled correctly. I think the problem occurs during runtime.
The original code may have problems when the CPU hot-plug modifies the
management CPU during runtime:
Assume that the CPUs of package 1 are 8-15, and the uncore driver has
been registered at boot time;
1. Offline all CPU No.8-15
2. Try online CPU No. 8, the code executes cpumask_set_cpu()
successfully, but fails in the uncore_freq_add_entry() process. At
this time, the mark of CPU No. 8 is added to uncore_cpu_mask, but no
sys interface is generated,cpu No.8 online fails;
3. Try online CPU No. 8 again, cpumask_any_and() judges success, and
the CPU No.8 online is successful at this time;
4. Assume that the attempt to online CPU No. 9-15 is successful at
this time, but there is no sys interface ————unexpected behavior 1.
5. Offline CPU No. 9-15, and offline No.8, will eventually call
uncore_freq_remove_die_entry()————unexpected behavior 2 is generated,
which may cause a crash.
>
>
>
> > >
> >
> > Fixes tag?
> >
> > > Signed-off-by: shouyeliu <shouyeliu@xxxxxxxxx>
> >
> > The correct format for tags is documented in
> > Documentation/process/5.Posting.rst:
> >
> > tag: Full Name <email address>
ok,I will modify it in the next version
> >
> > > ---
> > > .../x86/intel/uncore-frequency/uncore-frequency.c | 12
> > > ++++++++----
> > > 1 file changed, 8 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c b/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> > > index 40bbf8e45fa4..1de0a4a9d6cd 100644
> > > --- a/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> > > +++ b/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> > > @@ -146,15 +146,13 @@ static int uncore_event_cpu_online(unsigned
> > > int cpu)
> > > {
> > > struct uncore_data *data;
> > > int target;
> > > + int ret;
> > >
> > > /* Check if there is an online cpu in the package for
> > > uncore MSR */
> > > target = cpumask_any_and(&uncore_cpu_mask,
> > > topology_die_cpumask(cpu));
> > > if (target < nr_cpu_ids)
> > > return 0;
> > >
> > > - /* Use this CPU on this die as a control CPU */
> > > - cpumask_set_cpu(cpu, &uncore_cpu_mask);
> > > -
> > > data = uncore_get_instance(cpu);
> > > if (!data)
> > > return 0;
> > > @@ -163,7 +161,13 @@ static int uncore_event_cpu_online(unsigned
> > > int cpu)
> > > data->die_id = topology_die_id(cpu);
> > > data->domain_id = UNCORE_DOMAIN_ID_INVALID;
> > >
> > > - return uncore_freq_add_entry(data, cpu);
> > > + ret = uncore_freq_add_entry(data, cpu);
> > > + if (!ret) {
> > > + /* Use this CPU on this die as a control CPU */
> > > + cpumask_set_cpu(cpu, &uncore_cpu_mask);
> > > + }
> > > +
> > > + return ret;
> >
> > Please reverse to logic such that you return early on error, which is
> > the
> > usual error handling pattern.
ok,I will modify it in the next version
> >
> > > }
> > >
> > > static int uncore_event_cpu_offline(unsigned int cpu)
> > >
> >
>