Re: [PATCH] kexec: change locking mechanism to a mutex

From: Baoquan He
Date: Fri Sep 22 2023 - 04:07:45 EST


On 09/22/23 at 11:36am, Dave Young wrote:
> [Cced Valentin Schneider as he added the trylocks]
>
> On Fri, 22 Sept 2023 at 06:04, Eric DeVolder <eric.devolder@xxxxxxxxxx> wrote:
> >
> > Scaled up testing has revealed that the kexec_trylock()
> > implementation leads to failures within the crash hotplug
> > infrastructure due to the inability to acquire the lock,
> > specifically the message:
> >
> > crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
> >
> > When hotplug events occur, the crash hotplug infrastructure first
> > attempts to obtain the lock via the kexec_trylock(). However, the
> > implementation either acquires the lock, or fails and returns; there
> > is no waiting on the lock. Here is the comment/explanation from
> > kernel/kexec_internal.h:kexec_trylock():
> >
> > * Whatever is used to serialize accesses to the kexec_crash_image needs to be
> > * NMI safe, as __crash_kexec() can happen during nmi_panic(), so here we use a
> > * "simple" atomic variable that is acquired with a cmpxchg().
> >
> > While this in theory can happen for either CPU or memory hoptlug,
> > this problem is most prone to occur for memory hotplug.
> >
> > When memory is hot plugged, the memory is converted into smaller
> > 128MiB memblocks (typically). As each memblock is processed, a
> > kernel thread and a udev event thread are created. The udev thread
> > tries for the lock via the reading of the sysfs node
> > /sys/devices/system/memory/crash_hotplug node, and the kernel
> > worker thread tries for the lock upon entering the crash hotplug
> > infrastructure.
> >
> > These threads then compete for the kexec lock.
> >
> > For example, a 1GiB DIMM is converted into 8 memblocks, each
> > spawning two threads for a total of 16 threads that create a small
> > "swarm" all trying to acquire the lock. The larger the DIMM, the
> > more the memblocks and the larger the swarm.
> >
> > At the root of the problem is the atomic lock behind kexec_trylock();
> > it works well for low lock traffic; ie loading/unloading a capture
> > kernel, things that happen basically once. But with the introduction
> > of crash hotplug, the traffic through the lock increases significantly,
> > and more importantly in bursts occurring at roughly the same time. Thus
> > there is a need to wait on the lock.

Yeah, the atomic __kexec_lock is used to lock the door of operation on
kimage. Among kexec/kdump kernel load/unload/shrink/jumping, once any one
is in progress, the later attempt doesn't make sense. And these events are
rare.

Crash hotplug event is different, there will be many during one period.
The main problem you are encountering is the cocurrent handling of hotplug
event, right? Wondering if we can define another mutex lock to serialize
the handling of hotplug event like below. Just a sterotype to state my
thought.

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..39b9a57a4177 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -783,6 +783,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
{
struct kimage *image;

+ crash_hotplug_lock();
/* Obtain lock while changing crash information */
if (!kexec_trylock()) {
pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
@@ -852,6 +853,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu)
out:
/* Release lock now that update complete */
kexec_unlock();
+ crash_hotplug_unlock();
}

static int crash_memhp_notifier(struct notifier_block *nb, unsigned long val, void *v)
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 9dc728982d79..b95a73f35d9a 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -48,6 +48,7 @@
#include "kexec_internal.h"

atomic_t __kexec_lock = ATOMIC_INIT(0);
+DEFINE_MUTEX(__crash_hotplug_lock);

/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false;
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 74da1409cd14..32cb890bb059 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -28,6 +28,10 @@ static inline void kexec_unlock(void)
atomic_set_release(&__kexec_lock, 0);
}

+extern struct mutex __crash_hotplug_lock;
+#define crash_hotplug_lock() mutex_lock(&__crash_hotplug_lock)
+#define crash_hotplug_unlock() mutex_unlock(&__crash_hotplug_lock)
+
#ifdef CONFIG_KEXEC_FILE
#include <linux/purgatory.h>
void kimage_file_post_load_cleanup(struct kimage *image);