On Thu, Dec 19 2024 at 10:06, Waiman Long wrote:Thanks for the pointer, will trim down the backtrace.
Depending on the type of panics, it was found that theCan you please trim backtraces properly according to documentation:
__register_nmi_handler() function can be called in NMI context from
nmi_shootdown_cpus() leading to a lockdep splat like the following.
[ 1123.133573] ================================
[ 1123.137845] WARNING: inconsistent lock state
[ 1123.142118] 6.12.0-31.el10.x86_64+debug #1 Not tainted
[ 1123.147257] --------------------------------
[ 1123.151529] inconsistent {INITIAL USE} -> {IN-NMI} usage.
:
[ 1123.261544] Possible unsafe locking scenario:
[ 1123.261544]
[ 1123.267463] CPU0
[ 1123.269915] ----
[ 1123.272368] lock(&nmi_desc[0].lock);
[ 1123.276122] <Interrupt>
[ 1123.278746] lock(&nmi_desc[0].lock);
[ 1123.282671]
[ 1123.282671] *** DEADLOCK ***
:
[ 1123.314088] Call Trace:
[ 1123.316542] <NMI>
[ 1123.318562] dump_stack_lvl+0x6f/0xb0
[ 1123.322230] print_usage_bug.part.0+0x3d3/0x610
[ 1123.330618] lock_acquire.part.0+0x2e6/0x360
[ 1123.357217] _raw_spin_lock_irqsave+0x46/0x90
[ 1123.366193] __register_nmi_handler+0x8f/0x3a0
[ 1123.374401] nmi_shootdown_cpus+0x95/0x120
[ 1123.378509] kdump_nmi_shootdown_cpus+0x15/0x20
[ 1123.383040] native_machine_crash_shutdown+0x54/0x160
[ 1123.388095] __crash_kexec+0x10f/0x1f0
[ 1123.421465] ? __ghes_panic.cold+0x4f/0x5d
[ 1123.482648] </NMI>
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces-in-commit-messages
It is because I am not sure if there can only be one instance of nmi_shootdown_cpus() at any time. If there can't be more than one instance, I can remove the atomic instruction. I do remove the smp_wmb() in nmi_shootdown_cpus(). If I don't use an atomic instruction, I will have to add it smp_wmb() here.In this particular case, the following panic message was printed before.Fix it by doing ....
One way to address this problem is to remove all the panic() calls from
NMI context, but that can be too restrictive.
Another way to fix this problem while allowing panic() calls from
NMI context is by adding a new emergency NMI handler to the nmi_descThis tail comment is not only useless
structure and provide a new set_emergency_nmi_handler() helper to
atomically set crash_nmi_callback() in any context. The new emergency
handler will preempt other handlers in the linked list. That will
eliminate the need to take any lock and serve the panic in NMI use case.
+/*
+ * An emergency handler can be set in any context including NMI
+ */
struct nmi_desc {
raw_spinlock_t lock;
+ nmi_handler_t emerg_handler; /* Emergency handler */
+/**What's the point of this cmpxchg()? What's the concurrency problem this
+ * set_emergency_nmi_handler - Set emergency handler
+ * @type: NMI type
+ * @handler: the emergency handler to be stored
+ * Return: 0 if success, -EEXIST if a handler had been stored
+ *
+ * Atomically set an emergency NMI handler which, if set, will preempt all
+ * the other handlers in the linked list. If a NULL handler is passed in,
+ * it will clear it.
+ */
+int set_emergency_nmi_handler(unsigned int type, nmi_handler_t handler)
+{
+ struct nmi_desc *desc = nmi_to_desc(type);
+ nmi_handler_t orig = NULL;
+
+ if (!handler) {
+ orig = READ_ONCE(desc->emerg_handler);
+ WARN_ON_ONCE(!orig);
+ }
+
+ if (try_cmpxchg(&desc->emerg_handler, &orig, handler))
+ return 0;
tries to address?
+ if (WARN_ON_ONCE(orig == handler))These return values are there to be ignored at the only call site. So
+ return 0;
+ WARN_ONCE(1, "%s: failed to set emergency NMI handler!\n", __func__);
+ return -EEXIST;
what's the point of having them in the first place?