Re: [PATCH] platform/x86/intel-uncore-freq: fix inconsistent state on init failure
From: Ilpo Järvinen
Date: Tue Apr 15 2025 - 11:11:16 EST
On Tue, 15 Apr 2025, srinivas pandruvada wrote:
> On Tue, 2025-04-15 at 14:39 +0800, liu shouye wrote:
>
>
> srinivas pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx> 于2025年4月15日周二
> 00:08写道:
> On Mon, 2025-04-14 at 13:41 +0300, Ilpo Järvinen wrote:
> > On Mon, 14 Apr 2025, shouyeliu wrote:
> >
> > > When uncore_event_cpu_online() fails to initialize a control
> CPU
> > > (e.g.,
> > > due to memory allocation failure or uncore_freq_add_entry()
> > > errors),
> > > the code leaves stale entries in uncore_cpu_mask after that
> online
> > > CPU
> > > will not try to call uncore_freq_add_entry, resulting in no sys
> > > interface.
> >
> > Please add () after any name that refers to a C function (you're
> not
> > even
> > being consistent here as you had it in some cases but not here).
>
> ok,I will modify it in the next version
>
> >
> > Please try to split the very long sentence a bit and make it more
> > obvious
> > what causes what as the current wording is a bit vague, did you
> mean:
> > uncore_event_cpu_online() will not call uncore_freq_add_entry()
> for
> > another CPU that is being onlined or something along those lines?
> >
> > Will this change work/matter?
> Documentation/core-api/cpu_hotplug.rst
> > says
> > about cpuhp_setup_state():
> >
> > "If a callback fails for CPU N then the teardown callback for CPU
> > 0 .. N-1 is invoked to rollback the operation. The state setup
> > fails,
> > the callbacks for the state are not installed and in case of
> dynamic
> > allocation the allocated state is freed."
> >
>
> Yes, cpuhp_setup_state() will fail and which will result in clean up.
> So any fail of any fail uncore_event_cpu_online() will result in no sys
> entries.
>
> I think here the intention is to keep sys entries, which will not
> happen with this patch.
>
> For confirmation on 6.14 kernel, I forced failure on CPU 10:
>
> [595799.696873] intel_uncore_init
> [595799.700102] uncore_event_cpu_online cpu:0
> [595799.704240] uncore_event_cpu_online cpu:1
> [595799.708360] uncore_event_cpu_online cpu:2
> [595799.712505] uncore_event_cpu_online cpu:3
> [595799.716633] uncore_event_cpu_online cpu:4
> [595799.720755] uncore_event_cpu_online cpu:5
> [595799.724953] uncore_event_cpu_online cpu:6
> [595799.729158] uncore_event_cpu_online cpu:7
> [595799.733409] uncore_event_cpu_online cpu:8
> [595799.737674] uncore_event_cpu_online cpu:9
> [595799.741954] uncore_event_cpu_online cpu:10
> [595799.746134] Force CPU 10 to fail online
> [595799.750182] uncore_event_cpu_offline cpu:0
> [595799.754508] uncore_event_cpu_offline cpu:1
> [595799.758834] uncore_event_cpu_offline cpu:2
> [595799.763238] uncore_event_cpu_offline cpu:3
> [595799.767558] uncore_event_cpu_offline cpu:4
> [595799.771832] uncore_event_cpu_offline cpu:5
> [595799.776178] uncore_event_cpu_offline cpu:6
> [595799.780506] uncore_event_cpu_offline cpu:7
> [595799.784862] uncore_event_cpu_offline cpu:8
> [595799.789247] uncore_event_cpu_offline cpu:9
> [595799.793540] intel_uncore_init cpuhp_setup_state failed
> [595799.798776] intel_uncore_init failed
>
>
> Thanks,
> Srinivas
>
>
>
> Registering the CPU hot-plug callback function during booting can be handled
> correctly. I think the problem occurs during runtime.
> The original code may have problems when the CPU hot-plug modifies the
> management CPU during runtime:
> Assume that the CPUs of package 1 are 8-15, and the uncore driver has been
> registered at boot time;
> 1. Offline all CPU No.8-15
> 2. Try online CPU No. 8, the code executes cpumask_set_cpu() successfully, but
> fails in the uncore_freq_add_entry() process. At this time, the mark of CPU No.
> 8 is added to uncore_cpu_mask, but no sys interface is generated,cpu No.8
> online fails;
> 3. Try online CPU No. 8 again, cpumask_any_and() judges success, and the CPU
> No.8 online is successful at this time;
> 4. Assume that the attempt to online CPU No. 9-15 is successful at this time,
> but there is no sys interface ————unexpected behavior 1.
> 5. Offline CPU No. 9-15, and offline No.8, will eventually call
> uncore_freq_remove_die_entry()————unexpected behavior 2 is generated, which may
> cause a crash.
>
>
>
> I see the case in 2 socket server. So good to fix. Thanks for this.
All this extra information that this discussion has brought into light
should be included into the changelog (including the fact that this can
only occur after they were first setup without errors as the it would
have rolled back otherwise).
--
i.
>
> Also I suggest to look why the addition of one entry failed on your server system.
>
> Thanks,
> Srinivas
>
>
>
> > >
> >
> > Fixes tag?
> >
> > > Signed-off-by: shouyeliu <shouyeliu@xxxxxxxxx>
> >
> > The correct format for tags is documented in
> > Documentation/process/5.Posting.rst:
> >
> > tag: Full Name <email address>
>
> ok,I will modify it in the next version
> >
> > > ---
> > > .../x86/intel/uncore-frequency/uncore-frequency.c | 12
> > > ++++++++----
> > > 1 file changed, 8 insertions(+), 4 deletions(-)
> > >
> > > diff --git
> a/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> b/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> > > index 40bbf8e45fa4..1de0a4a9d6cd 100644
> > > --- a/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> > > +++ b/drivers/platform/x86/intel/uncore-frequency/uncore-
> > > frequency.c
> > > @@ -146,15 +146,13 @@ static int
> uncore_event_cpu_online(unsigned
> > > int cpu)
> > > {
> > > struct uncore_data *data;
> > > int target;
> > > + int ret;
> > >
> > > /* Check if there is an online cpu in the package for
> > > uncore MSR */
> > > target = cpumask_any_and(&uncore_cpu_mask,
> > > topology_die_cpumask(cpu));
> > > if (target < nr_cpu_ids)
> > > return 0;
> > >
> > > - /* Use this CPU on this die as a control CPU */
> > > - cpumask_set_cpu(cpu, &uncore_cpu_mask);
> > > -
> > > data = uncore_get_instance(cpu);
> > > if (!data)
> > > return 0;
> > > @@ -163,7 +161,13 @@ static int
> uncore_event_cpu_online(unsigned
> > > int cpu)
> > > data->die_id = topology_die_id(cpu);
> > > data->domain_id = UNCORE_DOMAIN_ID_INVALID;
> > >
> > > - return uncore_freq_add_entry(data, cpu);
> > > + ret = uncore_freq_add_entry(data, cpu);
> > > + if (!ret) {
> > > + /* Use this CPU on this die as a control CPU */
> > > + cpumask_set_cpu(cpu, &uncore_cpu_mask);
> > > + }
> > > +
> > > + return ret;
> >
> > Please reverse to logic such that you return early on error,
> which is
> > the
> > usual error handling pattern.
>
> ok,I will modify it in the next version
> >
> > > }
> > >
> > > static int uncore_event_cpu_offline(unsigned int cpu)
> > >
> >
>
>
>
>