Re: [PATCH v2 3/3] libkmod-module: add support for a patient module removal option

From: Lucas De Marchi
Date: Thu Sep 23 2021 - 04:52:01 EST


On Mon, Aug 09, 2021 at 10:16:02PM -0700, Luis Chamberlain wrote:
When doing tests with modules such as scsi_debug, on test
frameworks such as fstests [0] you may run into situations
where false positives are triggered and a test fails but
the reality is that the test did not fail, but what did
fail was the removal of the module since the refcnt is
not yet 0. There are many races possible with modules,
and each subsystem would have their own quirks for why
these races exist. Additionally, even after the refcnt is
0, userspace cannot be sure in any way that the refcnt
will remain 0, and so its easy to run into a situation
where the refcnt is bumped from 0 after modprobe just
saw it as 0.

The scsi_debug races are all documented on korg#21233 [1],
some issues are part scsi_debug alone, however other parts
of the race issue are completely due to incomplete described
semantics for modules and the inability for userspace to be
100% sure that a race cannot happen to bump the refcnt
once it observes the refcnt is 0.

I've written a script mod-refcnt-race.sh attached to
korg#214015 [2] which helps us easily reproduce the races
described and abstracts them away from fstests. This can be
used to both verify the issue and also test the new kmod
patient module remover. The same bug report also has a
simple busy-open-block-device-sleep.c which busy opens
a block device, sleep 4 seconds in a loop, closes the
file descriptor, which can easily be used to force failure
of the module removal (delete_module()) and test this
new patient removal.

Although there are patches for fstests to account for
this [3] and work around it, a much suitable solution
long term is for these hacks to use a patient module
remover from kmod and modprobe directly. This patch does
just that, it adds these new arguments to modprobe and
rmmod:

-p | --remove-patiently patiently removes the module
-t | --timeout timeout in ms to remove the module

The patient removal works by polling until the module refcnt
/sys/module/name/refcnt goes to 0. One that does happens it
tries to also patiently remove the module. This second step
is required as a new temporary refcnt bump can occur after
a module refcnt becomes 0. By default modprobe and rmmod
will wait forever, however a timeout can be specified to
in milliseconds to ensure only that amount of time is
spent trying to remove the module.

If a timeout is used it applies cumulatively to both the
refcnt wait, and for trying to remove the module once
modprobe sees the refcnt is already 0.

This new fatures is useful for cases where you know the
refcnt is going to become 0, and it is just a matter of time,
such as with many fstests / blktests tests.

It is worth noting that prior to kernel v3.13 the kernel
module delete system call used to support a wait flag,
this was removed via commit 3f2b9c9cdf389 ("module: remove
rmmod --wait option.") in favor for a 10 second sleep in
kmod. A 10 second sleep does not address the issues, and
so we must implement a patient removal properly in kmod.

The story was not kind like that. It wasn't removed "in favor for a 10
second sleep" in the sense that the sleep would replace the wait.

It was actually for "this wait logic in the kernel is complex and
buggy, let's try to remove it". So we decided to deprecate it and add
a sleep rmmod to see if anyone complained. 1 year later of no complains
we removed it from kernel. This was all after noticing we had never
implemented the wait logic in modprobe - it was only done in rmmod.



[0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
[1] https://bugzilla.kernel.org/show_bug.cgi?id=21233
[2] https://bugzilla.kernel.org/show_bug.cgi?id=214015
[3] https://lkml.kernel.org/r/20210727201045.2540681-1-mcgrof@xxxxxxxxxx
Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
---
libkmod/docs/libkmod-sections.txt | 4 +
libkmod/libkmod-module.c | 328 +++++++++++++++++++++++++++++
libkmod/libkmod.c | 71 +++++++
libkmod/libkmod.h | 7 +
libkmod/libkmod.sym | 4 +
libkmod/python/kmod/_libkmod_h.pxd | 3 +
libkmod/python/kmod/module.pyx | 4 +
man/modprobe.xml | 59 ++++++
man/rmmod.xml | 60 ++++++
tools/modprobe.c | 21 +-
tools/remove.c | 12 +-
tools/rmmod.c | 27 ++-
12 files changed, 584 insertions(+), 16 deletions(-)

diff --git a/libkmod/docs/libkmod-sections.txt b/libkmod/docs/libkmod-sections.txt
index e59ab7a..a5b71d2 100644
--- a/libkmod/docs/libkmod-sections.txt
+++ b/libkmod/docs/libkmod-sections.txt
@@ -12,6 +12,8 @@ kmod_dump_index

kmod_set_log_priority
kmod_get_log_priority
+kmod_get_refcnt_timeout
+kmod_set_refcnt_timeout
kmod_set_log_fn
kmod_get_userdata
kmod_set_userdata
@@ -56,6 +58,7 @@ kmod_module_unref_list
kmod_module_insert_module
kmod_module_probe_insert_module
kmod_module_remove_module
+kmod_module_remove_module_wait

kmod_module_get_module
kmod_module_get_dependencies
@@ -102,5 +105,6 @@ kmod_module_get_initstate
kmod_module_initstate_str
kmod_module_get_size
kmod_module_get_refcnt
+kmod_module_get_refcnt_wait
kmod_module_get_holders
</SECTION>
diff --git a/libkmod/libkmod-module.c b/libkmod/libkmod-module.c
index 04bb4d9..2d54d1a 100644
--- a/libkmod/libkmod-module.c
+++ b/libkmod/libkmod-module.c
@@ -30,6 +30,9 @@
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
+#include <poll.h>
+#include <time.h>
+#include <math.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/syscall.h>
@@ -802,6 +805,143 @@ KMOD_EXPORT int kmod_module_remove_module(struct kmod_module *mod,
return err;
}

+static int timespec_to_ms(struct timespec *t)
+{
+ return (t->tv_sec * 1000) + lround(t->tv_nsec / 1000000);
+}
+
+static int time_delta_ms(struct timespec *before, struct timespec *after)
+{
+ if (!before || !after)
+ return 0;
+ return timespec_to_ms(after) - timespec_to_ms(before);
+}

we have a similar thing in util.[ch]

+
+/**
+ * kmod_module_remove_module_wait:
+ * @mod: kmod module
+ * @flags: flags to pass to Linux kernel when removing the module. The only valid flag is
+ * KMOD_REMOVE_FORCE: force remove module regardless if it's still in
+ * use by a kernel subsystem or other process;
+ * KMOD_REMOVE_NOWAIT is always enforced, causing us to pass O_NONBLOCK to
+ * delete_module(2). We do the waiting in userspace, if a wait was desired.
+ *
+ * Remove a module from Linux kernel patiently.
+ *
+ * Returns: 0 on success or < 0 on failure.
+ */
+KMOD_EXPORT int kmod_module_remove_module_wait(struct kmod_module *mod,
+ unsigned int flags,
+ bool wait)

why do you have kmod_get_refcnt_timeout/kmod_set_refcnt_timeout instead
of just doing s/bool wait/unsigned int wait_msec/)?

+{
+ int err, err_time, t_delta, t_delta_print, orig_timeout;
+ int time_between_prints = 500; /* 1/2 second */
+ int timeout_ms;
+ struct timespec t1, t2, t_print = { .tv_sec = 0, .tv_nsec = 0};
+ bool printed_msg = false;
+
+ if (mod == NULL)
+ return -ENOENT;
+
+ if (wait)
+ timeout_ms = kmod_get_refcnt_timeout(mod->ctx);
+
+ if (!wait || !timeout_ms)
+ return kmod_module_remove_module(mod, flags);
+
+ err_time = clock_gettime(CLOCK_MONOTONIC, &t1);
+ if (err_time != 0)
+ return kmod_module_remove_module(mod, flags);
+
+ /*
+ * We now know an original patient removal was requested and
+ * an associated timeout was also provided. timeout_removal_ms
+ * should be set with what we are left with.
+ */
+ timeout_ms = orig_timeout = kmod_get_removal_timeout(mod->ctx);
+ DBG(mod->ctx, "removal timeout is %d for '%s'\n", timeout_ms, mod->name);
+
+ /*
+ * Filter out other flags and force ONONBLOCK
+ *
+ * This is true even for the patient removal. The wait flag was
+ * a now deprecated flag, see kernel commits:
+ *
+ * 3f2b9c9cdf389e ("module: remove rmmod --wait option." on v3.13
+ * 79465d2fd48e68 ("module: remove warning about waiting module removal.") on v3.15
+ *
+ * Patient removal is a userspace implementation.
+ */
+ flags &= KMOD_REMOVE_FORCE;
+ flags |= KMOD_REMOVE_NOWAIT;
+
+ while (true) {
+ err = delete_module(mod->name, flags);
+ if (err == 0)
+ return 0;
+ err = -errno;
+
+ if (timeout_ms == 0) {
+ ERR(mod->ctx, "timeout (%d) trying to remove module immediately '%s': %m %s\n",
+ timeout_ms, mod->name, strerror(errno));
+ return err;
+ }
+
+ /* Wait forever request */
+ if (timeout_ms < 0) {
+ if (!printed_msg) {
+ NOTICE(mod->ctx, "%s trying forever to remove...\n",
+ mod->name);
+ printed_msg = true;
+ }
+ continue;
+ }
+
+ /* All below is when we know we have a timeout */
+
+ err_time = clock_gettime(CLOCK_MONOTONIC, &t2);
+ if (err_time != 0) {
+ ERR(mod->ctx, "Using CLOCK_MONOTONIC failed, cannot wait, prior module removal module failed for '%s': %m %s\n",
+ mod->name, strerror(errno));
+ return err;
+ }
+
+ t_delta = time_delta_ms(&t1, &t2);
+ DBG(mod->ctx, "%s removal time delta: %d ms\n", mod->name, t_delta);
+
+ if (t_delta >= orig_timeout) {
+ ERR(mod->ctx, "timeout (%d) trying to remove module '%s': %m %s\n",
+ timeout_ms, mod->name, strerror(errno));
+ return err;
+ }
+
+ /*
+ * We have more time to remove the module. The above check
+ * also ensures that timeout_ms won't ever be 0.
+ */
+ timeout_ms = orig_timeout - t_delta;
+
+ if (!printed_msg) {
+ NOTICE(mod->ctx, "%s removal failed. Re-trying for %d more ms (total timeout of %d ms)\n",
+ mod->name, timeout_ms, orig_timeout);
+ t_print = t2;
+ printed_msg = true;
+ } else {
+ /*
+ * Computes the delta between our last print
+ * and the time of our current current
+ * evaluation.
+ */
+ t_delta_print = time_delta_ms(&t_print, &t2);
+ if (t_delta_print < time_between_prints)
+ continue;
+ NOTICE(mod->ctx, "%s removal failed. Re-trying for %d more ms (total timeout of %d ms)\n",
+ mod->name, timeout_ms, orig_timeout);
+ t_print = t2;
+ }
+ }
+}
+
extern long init_module(const void *mem, unsigned long len, const char *args);

/**
@@ -1926,6 +2066,194 @@ KMOD_EXPORT int kmod_module_get_refcnt(const struct kmod_module *mod)
return (int)refcnt;
}

+/**
+ * kmod_module_get_refcnt_wait:
+ * @mod: kmod module
+ * @wait: if true will wait until the refcnt is 0 patiently forever
+ *
+ * Get the ref count of this @mod, as returned by Linux Kernel, by reading
+ * /sys filesystem and wait patiently until the refcnt is 0, if the wait bool
+ * argument is set to true, otherwise this behaves just as the call
+ * kmod_module_get_refcnt(). Enabling wait is useful if you know for sure that
+ * the module is quiescing soon, and so you should be able to remove it soon.
+ * If wait is enabled, we wait a predetermined timeout setting which is
+ * is configurable with kmod_set_refcnt_timeout(), by default we wait
+ * forever.
+ *
+ * Returns: the reference count on success or < 0 on failure.
+ */
+KMOD_EXPORT int kmod_module_get_refcnt_wait(const struct kmod_module *mod, bool wait)
+{
+ struct pollfd fdset;
+ int err, err_time, num_fds=1, t_delta, t_delta_print, orig_timeout;
+ int time_between_prints = 500; /* 1/2 second */
+ int timeout_ms;
+ long refcnt;
+ struct timespec t1, t2, t_print = { .tv_sec = 0, .tv_nsec = 0};
+ bool printed_msg = false;
+
+ if (wait)
+ timeout_ms = kmod_get_refcnt_timeout(mod->ctx);
+
+ if (!wait || !timeout_ms)
+ return kmod_module_get_refcnt(mod);
+
+ orig_timeout = timeout_ms;
+
+ /*
+ * The refcnt may be something like 2, and poll() may return when
+ * the refcnt becomes 1, and so we must also check for the overall
+ * time ran and on each instance make the timeout available shorter
+ *
+ * If using the CLOCK_MONOTONIC failed and we have a timeout set, we
+ * simply cannot wait as we cannot trust time differences.
+ */
+ err_time = clock_gettime(CLOCK_MONOTONIC, &t1);
+ if (err_time != 0) {
+ ERR(mod->ctx, "Using the CLOCK_MONOTONIC failed (%s) so processing removal for %s without a wait for the refcnt\n",
+ strerror(errno), mod->name);
+ return kmod_module_get_refcnt(mod);
+ }
+
+ kmod_set_removal_timeout(mod->ctx, orig_timeout);
+
+ while (true) {
+ DBG(mod->ctx, "%s about to poll for refcnt using timeout %d ms\n", mod->name, timeout_ms);
+ memset(&fdset, 0, sizeof(struct pollfd));
+ fdset.events = POLLPRI | POLLIN;
+ fdset.fd = kmod_module_get_refcnt_fd(mod);
+ if (fdset.fd < 0)
+ return fdset.fd;
+ err = poll(&fdset, num_fds, timeout_ms);
+ if (err == -1) {
+ DBG(mod->ctx, "%s refcnt poll failed with err: %s\n", mod->name, strerror(errno));
+ goto err;
+ }
+ if (err == 0) { /* timeout before fds were read */
+ NOTICE(mod->ctx, "%s refcnt did not become 0 after %d ms\n",
+ mod->name, orig_timeout);
+ kmod_set_removal_timeout(mod->ctx, 0);
+ err = -ETIMEDOUT;
+ goto err;
+ }
+ if (err != num_fds) { /* on success, we only asked for one fd */
+ err = -EINVAL;
+ goto err;
+ }
+ if (!fdset.revents) {
+ close(fdset.fd);
+ continue;
+ }
+ if ((!fdset.revents & POLLIN)) {
+ err = -EINVAL;
+ goto err;
+ }
+
+ err = read_str_long(fdset.fd, &refcnt, 10);
+ if (err < 0)
+ goto err;
+
+ close(fdset.fd);
+
+ if ((refcnt <= 0) || (refcnt > 0 && !wait)) {
+ NOTICE(mod->ctx, "%s refcnt is %d\n", mod->name, (int) refcnt);
+ err_time = clock_gettime(CLOCK_MONOTONIC, &t2);
+ if (err_time != 0)
+ kmod_set_removal_timeout(mod->ctx, 0);

I don't follow why kmod_module_get_refcnt_wait() is setting the removal
timeout at all. This seems to be doing it behind users back.

The idea of using the refcnt fd was actually that then
users could integrate it on their mainloops (probably using epoll). And
then the same impl could be shared by kmod_module_remove_module_wait(),
which would do a select().

This seems more like a kmod_module_refcnt_wait_zero() using poll()
+ adjusting the timeout

...

diff --git a/tools/rmmod.c b/tools/rmmod.c
index 3942e7b..77c813a 100644
--- a/tools/rmmod.c
+++ b/tools/rmmod.c
@@ -35,10 +35,12 @@
#define DEFAULT_VERBOSE LOG_ERR
static int verbose = DEFAULT_VERBOSE;
static int use_syslog;
+static int do_remove_patient;

-static const char cmdopts_s[] = "fsvVwh";
+static const char cmdopts_s[] = "fpsvVwh";
static const struct option cmdopts[] = {
{"force", no_argument, 0, 'f'},
+ {"remove-patiently", no_argument, 0, 'p'},
{"syslog", no_argument, 0, 's'},
{"verbose", no_argument, 0, 'v'},
{"version", no_argument, 0, 'V'},
@@ -51,13 +53,14 @@ static void help(void)
printf("Usage:\n"
"\t%s [options] modulename ...\n"
"Options:\n"
- "\t-f, --force forces a module unload and may crash your\n"
- "\t machine. This requires Forced Module Removal\n"
- "\t option in your kernel. DANGEROUS\n"
- "\t-s, --syslog print to syslog, not stderr\n"
- "\t-v, --verbose enables more messages\n"
- "\t-V, --version show version\n"
- "\t-h, --help show this help\n",
+ "\t-f, --force forces a module unload and may crash your\n"
+ "\t machine. This requires Forced Module Removal\n"
+ "\t option in your kernel. DANGEROUS\n"
+ "\t-p, --remove-patiently patiently tries to remove the module\n"
+ "\t-s, --syslog print to syslog, not stderr\n"
+ "\t-v, --verbose enables more messages\n"
+ "\t-V, --version show version\n"
+ "\t-h, --help show this help\n",
program_invocation_short_name);
}

@@ -93,7 +96,7 @@ static int check_module_inuse(struct kmod_module *mod) {
return -EBUSY;
}

- ret = kmod_module_get_refcnt(mod);
+ ret = kmod_module_get_refcnt_wait(mod, do_remove_patient);

for tool implementation, shouldn't we just ignore
kmod_module_get_refcnt() and proceed to
kmod_module_remove_module_wait()?

if (ret > 0) {
ERR("Module %s is in use\n", kmod_module_get_name(mod));
return -EBUSY;
@@ -120,6 +123,9 @@ static int do_rmmod(int argc, char *argv[])
case 'f':
flags |= KMOD_REMOVE_FORCE;
break;
+ case 'p':
+ do_remove_patient = 1;
+ break;
case 's':
use_syslog = 1;
break;
@@ -178,7 +184,8 @@ static int do_rmmod(int argc, char *argv[])
goto next;
}

- err = kmod_module_remove_module(mod, flags);
+ err = kmod_module_remove_module_wait(mod, flags,
+ do_remove_patient);
if (err < 0) {
ERR("could not remove module %s: %s\n", arg,
strerror(-err));
--
2.30.2