Re: [PATCH] platform/x86/intel-uncore-freq: fix inconsistent state on init failure
From: srinivas pandruvada
Date: Tue Apr 15 2025 - 18:11:09 EST
On Tue, 2025-04-15 at 18:10 +0300, Ilpo Järvinen wrote:
> On Tue, 15 Apr 2025, srinivas pandruvada wrote:
>
> > On Tue, 2025-04-15 at 14:39 +0800, liu shouye wrote:
> >
> >
> > srinivas pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
> > 于2025年4月15日周二
> > 00:08写道:
> > On Mon, 2025-04-14 at 13:41 +0300, Ilpo Järvinen wrote:
> > > On Mon, 14 Apr 2025, shouyeliu wrote:
> > >
> > > > When uncore_event_cpu_online() fails to initialize a
> > control
> > CPU
> > > > (e.g.,
> > > > due to memory allocation failure or
> > uncore_freq_add_entry()
> > > > errors),
> > > > the code leaves stale entries in uncore_cpu_mask after
> > that
> > online
> > > > CPU
> > > > will not try to call uncore_freq_add_entry, resulting in
> > no sys
> > > > interface.
> > >
> > > Please add () after any name that refers to a C function
> > (you're
> > not
> > > even
> > > being consistent here as you had it in some cases but not
> > here).
> >
> > ok,I will modify it in the next version
> >
> > >
> > > Please try to split the very long sentence a bit and make
> > it more
> > > obvious
> > > what causes what as the current wording is a bit vague, did
> > you
> > mean:
> > > uncore_event_cpu_online() will not call
> > uncore_freq_add_entry()
> > for
> > > another CPU that is being onlined or something along those
> > lines?
> > >
> > > Will this change work/matter?
> > Documentation/core-api/cpu_hotplug.rst
> > > says
> > > about cpuhp_setup_state():
> > >
> > > "If a callback fails for CPU N then the teardown callback
> > for CPU
> > > 0 .. N-1 is invoked to rollback the operation. The state
> > setup
> > > fails,
> > > the callbacks for the state are not installed and in case
> > of
> > dynamic
> > > allocation the allocated state is freed."
> > >
> >
> > Yes, cpuhp_setup_state() will fail and which will result in clean
> > up.
> > So any fail of any fail uncore_event_cpu_online() will result in no
> > sys
> > entries.
> >
> > I think here the intention is to keep sys entries, which will not
> > happen with this patch.
> >
> > For confirmation on 6.14 kernel, I forced failure on CPU 10:
> >
> > [595799.696873] intel_uncore_init
> > [595799.700102] uncore_event_cpu_online cpu:0
> > [595799.704240] uncore_event_cpu_online cpu:1
> > [595799.708360] uncore_event_cpu_online cpu:2
> > [595799.712505] uncore_event_cpu_online cpu:3
> > [595799.716633] uncore_event_cpu_online cpu:4
> > [595799.720755] uncore_event_cpu_online cpu:5
> > [595799.724953] uncore_event_cpu_online cpu:6
> > [595799.729158] uncore_event_cpu_online cpu:7
> > [595799.733409] uncore_event_cpu_online cpu:8
> > [595799.737674] uncore_event_cpu_online cpu:9
> > [595799.741954] uncore_event_cpu_online cpu:10
> > [595799.746134] Force CPU 10 to fail online
> > [595799.750182] uncore_event_cpu_offline cpu:0
> > [595799.754508] uncore_event_cpu_offline cpu:1
> > [595799.758834] uncore_event_cpu_offline cpu:2
> > [595799.763238] uncore_event_cpu_offline cpu:3
> > [595799.767558] uncore_event_cpu_offline cpu:4
> > [595799.771832] uncore_event_cpu_offline cpu:5
> > [595799.776178] uncore_event_cpu_offline cpu:6
> > [595799.780506] uncore_event_cpu_offline cpu:7
> > [595799.784862] uncore_event_cpu_offline cpu:8
> > [595799.789247] uncore_event_cpu_offline cpu:9
> > [595799.793540] intel_uncore_init cpuhp_setup_state failed
> > [595799.798776] intel_uncore_init failed
> >
> >
> > Thanks,
> > Srinivas
> >
> >
> >
> > Registering the CPU hot-plug callback function during booting can
> > be handled
> > correctly. I think the problem occurs during runtime.
> > The original code may have problems when the CPU hot-plug modifies
> > the
> > management CPU during runtime:
> > Assume that the CPUs of package 1 are 8-15, and the uncore driver
> > has been
> > registered at boot time;
> > 1. Offline all CPU No.8-15
> > 2. Try online CPU No. 8, the code executes cpumask_set_cpu()
> > successfully, but
> > fails in the uncore_freq_add_entry() process. At this time, the
> > mark of CPU No.
> > 8 is added to uncore_cpu_mask, but no sys interface is
> > generated,cpu No.8
> > online fails;
> > 3. Try online CPU No. 8 again, cpumask_any_and() judges success,
> > and the CPU
> > No.8 online is successful at this time;
> > 4. Assume that the attempt to online CPU No. 9-15 is successful at
> > this time,
> > but there is no sys interface ————unexpected behavior 1.
> > 5. Offline CPU No. 9-15, and offline No.8, will eventually call
> > uncore_freq_remove_die_entry()————unexpected behavior 2 is
> > generated, which may
> > cause a crash.
> >
> >
> >
> > I see the case in 2 socket server. So good to fix. Thanks for this.
>
> All this extra information that this discussion has brought into
> light
> should be included into the changelog (including the fact that this
> can
> only occur after they were first setup without errors as the it would
> have rolled back otherwise).
>
May be something like this:
"
Fix missing uncore sysfs during CPU hotplug
In certain situations, the sysfs for uncore may not be present when all
CPUs in a package are offlined and then brought back online after boot.
This issue can occur if there is an error in adding the sysfs entry due
to a memory allocation failure. Retrying to bring the CPUs online will
not resolve the issue, as the uncore_cpu_mask is already set for the
package before the failure condition occurs.
This issue does not occur if the failure happens during module
initialization, as the module will fail to load in the event of any
error.
To address this, ensure that the uncore_cpu_mask is not set until the
successful return of uncore_freq_add_entry().
"
Thanks,
Srinivas