Re: (sashiko review) [RFC PATCH v2] mm/damon/core: fix damon_call() vs kdamond_fn() exit race deadlock
From: SeongJae Park
Date: Thu Mar 26 2026 - 22:22:27 EST
Forwarding Sashiko review for doing discussions via mails.
# review url: https://sashiko.dev/#/patchset/20260327004952.58266-1-sj@xxxxxxxxxx
# start of sashiko.dev inline review
commit 36b78ee39968830aca3d375219a2cfce73d0d1d0
Author: SeongJae Park <sj@xxxxxxxxxx>
mm/damon/core: fix damon_call() vs kdamond_fn() exit race deadlock
This commit fixes a race condition between damon_call and the
kdamond_fn exit path that could result in a deadlock. It introduces a
call_controls_obsolete flag to prevent damon_call from waiting
indefinitely for a terminating thread.
Link: https://lore.kernel.org/20260325141956.87144-1-sj@xxxxxxxxxx
> diff --git a/mm/damon/core.c b/mm/damon/core.c
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
[ ... ]
> @@ -3062,6 +3042,9 @@ static int kdamond_fn(void *data)
> damon_destroy_targets(ctx);
>
> kfree(ctx->regions_score_histogram);
> + mutex_lock(&ctx->call_controls_lock);
> + ctx->call_controls_obsolete = true;
> + mutex_unlock(&ctx->call_controls_lock);
> kdamond_call(ctx, true);
> damos_walk_cancel(ctx);
Does damos_walk suffer from the same exit race regression that this patch
fixes for damon_call?
In kdamond_fn, the thread termination sequence executes damos_walk_cancel
and only later sets ctx->kdamond = NULL under the kdamond_lock.
damos_walk appears to use the same pattern as the old damon_call where it
registers ctx->walk_control, checks damon_is_running (which verifies
ctx->kdamond != NULL), and then unconditionally blocks on wait_for_completion.
If a user triggers a DAMOS walk in the exact window after damos_walk_cancel
has finished but before ctx->kdamond = NULL is executed, wouldn't damos_walk
successfully register its control and pass the damon_is_running check?
If so, it would then block indefinitely on wait_for_completion because the
cancellation phase has already passed and the thread is exiting.
Because damos_walk is invoked via state_store with the global
damon_sysfs_lock held, could this infinite wait block the entire DAMON sysfs
interface for all users on the system?
# end of sashiko.dev inline review
# review url: https://sashiko.dev/#/patchset/20260327004952.58266-1-sj@xxxxxxxxxx
#
# hkml [1] generated a draft of this mail. It can be regenerated
# using below command:
#
# hkml patch sashiko_dev --for_forwarding \
# 20260327004952.58266-1-sj@xxxxxxxxxx
#
# [1] https://github.com/sjp38/hackermail
Sent using hkml (https://github.com/sjp38/hackermail)