Re: (sashiko review) [RFC PATCH v2] mm/damon/core: fix damon_call() vs kdamond_fn() exit race deadlock

From: SeongJae Park

Date: Thu Mar 26 2026 - 23:54:45 EST


On Thu, 26 Mar 2026 19:20:13 -0700 SeongJae Park <sj@xxxxxxxxxx> wrote:

> On Thu, 26 Mar 2026 19:15:45 -0700 SeongJae Park <sj@xxxxxxxxxx> wrote:
>
> > Forwarding Sashiko review for doing discussions via mails.
> >
> > # review url: https://sashiko.dev/#/patchset/20260327004952.58266-1-sj@xxxxxxxxxx
> > # start of sashiko.dev inline review
> > commit 36b78ee39968830aca3d375219a2cfce73d0d1d0
> > Author: SeongJae Park <sj@xxxxxxxxxx>
> >
> > mm/damon/core: fix damon_call() vs kdamond_fn() exit race deadlock
> >
> > This commit fixes a race condition between damon_call and the
> > kdamond_fn exit path that could result in a deadlock. It introduces a
> > call_controls_obsolete flag to prevent damon_call from waiting
> > indefinitely for a terminating thread.
> >
> > Link: https://lore.kernel.org/20260325141956.87144-1-sj@xxxxxxxxxx
> >
> > > diff --git a/mm/damon/core.c b/mm/damon/core.c
> > > --- a/mm/damon/core.c
> > > +++ b/mm/damon/core.c
> >
> > [ ... ]
> >
> > > @@ -3062,6 +3042,9 @@ static int kdamond_fn(void *data)
> > > damon_destroy_targets(ctx);
> > >
> > > kfree(ctx->regions_score_histogram);
> > > + mutex_lock(&ctx->call_controls_lock);
> > > + ctx->call_controls_obsolete = true;
> > > + mutex_unlock(&ctx->call_controls_lock);
> > > kdamond_call(ctx, true);
> > > damos_walk_cancel(ctx);
> >
> > Does damos_walk suffer from the same exit race regression that this patch
> > fixes for damon_call?
> >
> > In kdamond_fn, the thread termination sequence executes damos_walk_cancel
> > and only later sets ctx->kdamond = NULL under the kdamond_lock.
> >
> > damos_walk appears to use the same pattern as the old damon_call where it
> > registers ctx->walk_control, checks damon_is_running (which verifies
> > ctx->kdamond != NULL), and then unconditionally blocks on wait_for_completion.
> >
> > If a user triggers a DAMOS walk in the exact window after damos_walk_cancel
> > has finished but before ctx->kdamond = NULL is executed, wouldn't damos_walk
> > successfully register its control and pass the damon_is_running check?
> >
> > If so, it would then block indefinitely on wait_for_completion because the
> > cancellation phase has already passed and the thread is exiting.
> >
> > Because damos_walk is invoked via state_store with the global
> > damon_sysfs_lock held, could this infinite wait block the entire DAMON sysfs
> > interface for all users on the system?
>
> Yes, and I'm working on it separately, as I also replied [1] to the previous
> version.

Maybe I will make the next version of this patch as a series of two patches,
one for this one and the other one for the damos_walk() race fix.

>
> [1] https://lore.kernel.org/20260326135146.90670-1-sj@xxxxxxxxxx


Thanks,
SJ

[...]