Re: [PATCH] platform/x86/intel-uncore-freq: fix inconsistent state on init failure
From: liu shouye
Date: Tue Apr 15 2025 - 22:34:27 EST
srinivas pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx> 于2025年4月16日周三 06:11写道:
>
> On Tue, 2025-04-15 at 18:10 +0300, Ilpo Järvinen wrote:
> > On Tue, 15 Apr 2025, srinivas pandruvada wrote:
> >
> > > On Tue, 2025-04-15 at 14:39 +0800, liu shouye wrote:
> > >
> > >
> > > srinivas pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx>
> > > 于2025年4月15日周二
> > > 00:08写道:
> > > On Mon, 2025-04-14 at 13:41 +0300, Ilpo Järvinen wrote:
> > > > On Mon, 14 Apr 2025, shouyeliu wrote:
> > > >
> > > > > When uncore_event_cpu_online() fails to initialize a
> > > control
> > > CPU
> > > > > (e.g.,
> > > > > due to memory allocation failure or
> > > uncore_freq_add_entry()
> > > > > errors),
> > > > > the code leaves stale entries in uncore_cpu_mask after
> > > that
> > > online
> > > > > CPU
> > > > > will not try to call uncore_freq_add_entry, resulting in
> > > no sys
> > > > > interface.
> > > >
> > > > Please add () after any name that refers to a C function
> > > (you're
> > > not
> > > > even
> > > > being consistent here as you had it in some cases but not
> > > here).
> > >
> > > ok,I will modify it in the next version
> > >
> > > >
> > > > Please try to split the very long sentence a bit and make
> > > it more
> > > > obvious
> > > > what causes what as the current wording is a bit vague, did
> > > you
> > > mean:
> > > > uncore_event_cpu_online() will not call
> > > uncore_freq_add_entry()
> > > for
> > > > another CPU that is being onlined or something along those
> > > lines?
> > > >
> > > > Will this change work/matter?
> > > Documentation/core-api/cpu_hotplug.rst
> > > > says
> > > > about cpuhp_setup_state():
> > > >
> > > > "If a callback fails for CPU N then the teardown callback
> > > for CPU
> > > > 0 .. N-1 is invoked to rollback the operation. The state
> > > setup
> > > > fails,
> > > > the callbacks for the state are not installed and in case
> > > of
> > > dynamic
> > > > allocation the allocated state is freed."
> > > >
> > >
> > > Yes, cpuhp_setup_state() will fail and which will result in clean
> > > up.
> > > So any fail of any fail uncore_event_cpu_online() will result in no
> > > sys
> > > entries.
> > >
> > > I think here the intention is to keep sys entries, which will not
> > > happen with this patch.
> > >
> > > For confirmation on 6.14 kernel, I forced failure on CPU 10:
> > >
> > > [595799.696873] intel_uncore_init
> > > [595799.700102] uncore_event_cpu_online cpu:0
> > > [595799.704240] uncore_event_cpu_online cpu:1
> > > [595799.708360] uncore_event_cpu_online cpu:2
> > > [595799.712505] uncore_event_cpu_online cpu:3
> > > [595799.716633] uncore_event_cpu_online cpu:4
> > > [595799.720755] uncore_event_cpu_online cpu:5
> > > [595799.724953] uncore_event_cpu_online cpu:6
> > > [595799.729158] uncore_event_cpu_online cpu:7
> > > [595799.733409] uncore_event_cpu_online cpu:8
> > > [595799.737674] uncore_event_cpu_online cpu:9
> > > [595799.741954] uncore_event_cpu_online cpu:10
> > > [595799.746134] Force CPU 10 to fail online
> > > [595799.750182] uncore_event_cpu_offline cpu:0
> > > [595799.754508] uncore_event_cpu_offline cpu:1
> > > [595799.758834] uncore_event_cpu_offline cpu:2
> > > [595799.763238] uncore_event_cpu_offline cpu:3
> > > [595799.767558] uncore_event_cpu_offline cpu:4
> > > [595799.771832] uncore_event_cpu_offline cpu:5
> > > [595799.776178] uncore_event_cpu_offline cpu:6
> > > [595799.780506] uncore_event_cpu_offline cpu:7
> > > [595799.784862] uncore_event_cpu_offline cpu:8
> > > [595799.789247] uncore_event_cpu_offline cpu:9
> > > [595799.793540] intel_uncore_init cpuhp_setup_state failed
> > > [595799.798776] intel_uncore_init failed
> > >
> > >
> > > Thanks,
> > > Srinivas
> > >
> > >
> > >
> > > Registering the CPU hot-plug callback function during booting can
> > > be handled
> > > correctly. I think the problem occurs during runtime.
> > > The original code may have problems when the CPU hot-plug modifies
> > > the
> > > management CPU during runtime:
> > > Assume that the CPUs of package 1 are 8-15, and the uncore driver
> > > has been
> > > registered at boot time;
> > > 1. Offline all CPU No.8-15
> > > 2. Try online CPU No. 8, the code executes cpumask_set_cpu()
> > > successfully, but
> > > fails in the uncore_freq_add_entry() process. At this time, the
> > > mark of CPU No.
> > > 8 is added to uncore_cpu_mask, but no sys interface is
> > > generated,cpu No.8
> > > online fails;
> > > 3. Try online CPU No. 8 again, cpumask_any_and() judges success,
> > > and the CPU
> > > No.8 online is successful at this time;
> > > 4. Assume that the attempt to online CPU No. 9-15 is successful at
> > > this time,
> > > but there is no sys interface ————unexpected behavior 1.
> > > 5. Offline CPU No. 9-15, and offline No.8, will eventually call
> > > uncore_freq_remove_die_entry()————unexpected behavior 2 is
> > > generated, which may
> > > cause a crash.
> > >
> > >
> > >
> > > I see the case in 2 socket server. So good to fix. Thanks for this.
> > >
> > > Also I suggest to look why the addition of one entry failed on your server system.
Sorry for my description problem. My machine also has two sockets.
After all, a one-socket machine cannot offline all CPUs.
> >
> > All this extra information that this discussion has brought into
> > light
> > should be included into the changelog (including the fact that this
> > can
> > only occur after they were first setup without errors as the it would
> > have rolled back otherwise).
Got it. My previous commit was indeed not described clearly.
Thanks,
Shouye
> >
>
> May be something like this:
>
> "
> Fix missing uncore sysfs during CPU hotplug
>
> In certain situations, the sysfs for uncore may not be present when all
> CPUs in a package are offlined and then brought back online after boot.
>
> This issue can occur if there is an error in adding the sysfs entry due
> to a memory allocation failure. Retrying to bring the CPUs online will
> not resolve the issue, as the uncore_cpu_mask is already set for the
> package before the failure condition occurs.
>
> This issue does not occur if the failure happens during module
> initialization, as the module will fail to load in the event of any
> error.
>
> To address this, ensure that the uncore_cpu_mask is not set until the
> successful return of uncore_freq_add_entry().
> "
>
> Thanks,
> Srinivas
>
Very clear and comprehensive commit description - I will apply it in
the next version.
Thanks,
Shouye