Re: [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering max below current usage

From: Thomas Hellström

Date: Wed Apr 22 2026 - 06:26:21 EST

On Wed, 2026-04-22 at 11:50 +0200, Maarten Lankhorst wrote:
> Hey,
>
> Den 2026-04-22 kl. 10:42, skrev Thomas Hellström:
> > On Wed, 2026-04-22 at 10:31 +0200, Maarten Lankhorst wrote:
> > > Hey,
> > >
> > > (Adding Thadeu to cc since they've been working on the same
> > > issue)
> > >
> > > Den 2026-03-27 kl. 09:15, skrev Thomas Hellström:
> > > > Add an optional reclaim callback to struct dmem_cgroup_region.
> > > > When
> > > > dmem.max is set below current usage, invoke the callback to
> > > > evict
> > > > memory
> > > > and retry setting the limit rather than failing immediately.
> > > > Signal
> > > > interruptions propagate back to the write() caller.
> > > >
> > > > RFC:
> > > > Due to us updating the max limit _after_ the usage has been
> > > > sufficiently lowered, this should be prone to failures if there
> > > > are
> > > > aggressive allocators running in parallel to the reclaim.
> > > > So can we somehow enforce the new limit while the eviction is
> > > > happening?
> > > >
> > > > Assisted-by: GitHub Copilot:claude-sonnet-4.6
> > > > Signed-off-by: Thomas Hellström
> > > > <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > ---
> > > > include/linux/cgroup_dmem.h | 11 +++++
> > > > kernel/cgroup/dmem.c        | 94
> > > > +++++++++++++++++++++++++++++++++----
> > > > 2 files changed, 96 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/include/linux/cgroup_dmem.h
> > > > b/include/linux/cgroup_dmem.h
> > > > index dd4869f1d736..61520a431740 100644
> > > > --- a/include/linux/cgroup_dmem.h
> > > > +++ b/include/linux/cgroup_dmem.h
> > > > @@ -26,6 +26,10 @@ bool dmem_cgroup_state_evict_valuable(struct
> > > > dmem_cgroup_pool_state *limit_pool,
> > > >       bool ignore_low, bool
> > > > *ret_hit_low);
> > > >
> > > > void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state
> > > > *pool);
> > > > +void dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region
> > > > *region,
> > > > +     int (*reclaim)(struct
> > > > dmem_cgroup_pool_state *pool,
> > > > +    u64
> > > > target_bytes, void *priv),
> > > > +     void *priv);
> > > > #else
> > > > static inline __printf(2,3) struct dmem_cgroup_region *
> > > > dmem_cgroup_register_region(u64 size, const char *name_fmt,
> > > > ...)
> > > > @@ -62,5 +66,12 @@ bool dmem_cgroup_state_evict_valuable(struct
> > > > dmem_cgroup_pool_state *limit_pool,
> > > > static inline void dmem_cgroup_pool_state_put(struct
> > > > dmem_cgroup_pool_state *pool)
> > > > { }
> > > >
> > > > +static inline void
> > > > +dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region
> > > > *region,
> > > > +        int (*reclaim)(struct
> > > > dmem_cgroup_pool_state *pool,
> > > > +       u64
> > > > target_bytes,
> > > > void *priv),
> > > > +        void *priv)
> > > > +{ }
> > > > +
> > > > #endif
> > > > #endif /* _CGROUP_DMEM_H */
> > > > diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
> > > > index 3e6d4c0b26a1..f993fb058b74 100644
> > > > --- a/kernel/cgroup/dmem.c
> > > > +++ b/kernel/cgroup/dmem.c
> > > > @@ -51,6 +51,18 @@ struct dmem_cgroup_region {
> > > > * No new pools should be added to the region
> > > > afterwards.
> > > > */
> > > > bool unregistered;
> > > > +
> > > > + /**
> > > > + * @reclaim: Optional callback invoked when dmem.max
> > > > is
> > > > set below the
> > > > + * current usage of a pool. The driver should attempt
> > > > to
> > > > free at least
> > > > + * @target_bytes from @pool. May be called multiple
> > > > times
> > > > if usage
> > > > + * remains above the limit after returning.
> > > > + */
> > > > + int (*reclaim)(struct dmem_cgroup_pool_state *pool,
> > > > u64
> > > > target_bytes,
> > > > +        void *priv);
> > > > +
> > > > + /** @reclaim_priv: Private data passed to @reclaim. */
> > > > + void *reclaim_priv;
> > > > };
> > > >
> > > > struct dmemcg_state {
> > > > @@ -145,23 +157,59 @@ static void free_cg_pool(struct
> > > > dmem_cgroup_pool_state *pool)
> > > > }
> > > >
> > > > static int
> > > > -set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val)
> > > > +set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val,
> > > > + struct dmem_cgroup_region *region)
> > > > {
> > > > page_counter_set_min(&pool->cnt, val);
> > > > return 0;
> > > > }
> > > >
> > > > static int
> > > > -set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
> > > > +set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val,
> > > > + struct dmem_cgroup_region *region)
> > > > {
> > > > page_counter_set_low(&pool->cnt, val);
> > > > return 0;
> > > > }
> > > >
> > > > static int
> > > > -set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
> > > > +set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val,
> > > > + struct dmem_cgroup_region *region)
> > > > {
> > > > - return page_counter_set_max(&pool->cnt, val);
> > > > + int err = page_counter_set_max(&pool->cnt, val);
> > > > +
> > > > + if (err != -EBUSY || !region || !region->reclaim)
> > > > + return err;
> > > > +
> > > > + /*
> > > > + * The new max is below current usage. Ask the driver
> > > > to
> > > > evict memory
> > > > + * and retry, up to a bounded number of times. Signal
> > > > interruptions are
> > > > + * propagated back to the write() caller; other
> > > > reclaim
> > > > failures leave
> > > > + * -EBUSY as the result.
> > > > + */
> > > > + for (int retries = 5; retries > 0; retries--) {
> > > > + u64 usage = page_counter_read(&pool->cnt);
> > > > + u64 target = usage > val ? usage - val : 0;
> > > > + int reclaim_err;
> > > > +
> > > > + if (!target) {
> > > > + err = page_counter_set_max(&pool->cnt,
> > > > val);
> > > > + break;
> > > > + }
> > > > +
> > > > + reclaim_err = region->reclaim(pool, target,
> > > > region->reclaim_priv);
> > > > + if (reclaim_err) {
> > > > + if (reclaim_err == -EINTR ||
> > > > reclaim_err
> > > > == -ERESTARTSYS)
> > > > + err = reclaim_err;
> > > > + break;
> > > > + }
> > > > +
> > > > + err = page_counter_set_max(&pool->cnt, val);
> > > > + if (err != -EBUSY)
> > > > + break;
> > > > + }
> > > > +
> > > > + return err;
> > > > }
> > >
> > > I mentioned this in chat but I wanted to mention it on the
> > > mailing
> > > list for others as well,
> > > can we reproduce the behavior from memory_max_write() in
> > > mm/memcontrol.c?
> > >
> > > 1. First set new limit through xchg.
> > > 2. If O_NONBLOCK is set -> do nothing, next allocation in target
> > > region will fail and cause reclaim.
> > > 3. If not set -> reclaim until below new limit or interrupted by
> > > a
> > > signal, return success in all cases here since we set new limit.
> > >
> > >
> >
> > Yup.
> >
> > For 3, we also need to consider the case where we fail to reclaim
> > due
> > to memory being pinned. If it's OK to (usually temporary) have
> > current
> > usage above max, that would work.
> >
> > I have that coded up and also add a patch on top to defer reclaim
> > to a
> > thread if we bail due to signal or O_NONBLOCK. Perhaps we could
> > discuss
> > whether that's a good or bad idea in that patch.
>
> That doesn't sound like a good idea. The semantics of O_NONBLOCK
> are deliberately intended to be able to change the max without
> causing
> reclaim.
>
> See the details in commit ("memcg: introduce non-blocking limit
> setting option")

>From reading the docs that introduces, it sounds more like that avoids
*synchronous* reclaim, which is also in line with O_NONBLOCK semantics.

The analogy with launching a thread would be more that of kswapd doing
the reclaim in the memcg case?

But OTOH, if we were to introduce a thread-driven dmem reclaim that
would perhaps be something that wasn't directly tied to the dmem
controller but rather to the dmem provider itself. (TTM in this case).

Thanks,
/Thomas

>
> I also believe it's ok not to continue reclaiming if aborted, the
> caller can
> always try again if necessary.
>
> If we want to deviate from the memcg controller, we need a very good
> reason
> to do so. I'd like to keep the semantics the same if possible.
>
> > Will send out when I've updated the IGT tests accordingly.
> >
> > Thanks,
> > Thomas
>
> Kind regards,
> ~Maarten Lankhorst