Re: [PATCH 2/5] cgroup/dmem: Add reclaim callback for lowering max below current usage
From: Maarten Lankhorst
Date: Wed Apr 22 2026 - 05:53:17 EST
Hey,
Den 2026-04-22 kl. 10:42, skrev Thomas Hellström:
> On Wed, 2026-04-22 at 10:31 +0200, Maarten Lankhorst wrote:
>> Hey,
>>
>> (Adding Thadeu to cc since they've been working on the same issue)
>>
>> Den 2026-03-27 kl. 09:15, skrev Thomas Hellström:
>>> Add an optional reclaim callback to struct dmem_cgroup_region.
>>> When
>>> dmem.max is set below current usage, invoke the callback to evict
>>> memory
>>> and retry setting the limit rather than failing immediately.
>>> Signal
>>> interruptions propagate back to the write() caller.
>>>
>>> RFC:
>>> Due to us updating the max limit _after_ the usage has been
>>> sufficiently lowered, this should be prone to failures if there are
>>> aggressive allocators running in parallel to the reclaim.
>>> So can we somehow enforce the new limit while the eviction is
>>> happening?
>>>
>>> Assisted-by: GitHub Copilot:claude-sonnet-4.6
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
>>> ---
>>> include/linux/cgroup_dmem.h | 11 +++++
>>> kernel/cgroup/dmem.c | 94
>>> +++++++++++++++++++++++++++++++++----
>>> 2 files changed, 96 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/cgroup_dmem.h
>>> b/include/linux/cgroup_dmem.h
>>> index dd4869f1d736..61520a431740 100644
>>> --- a/include/linux/cgroup_dmem.h
>>> +++ b/include/linux/cgroup_dmem.h
>>> @@ -26,6 +26,10 @@ bool dmem_cgroup_state_evict_valuable(struct
>>> dmem_cgroup_pool_state *limit_pool,
>>> bool ignore_low, bool
>>> *ret_hit_low);
>>>
>>> void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state
>>> *pool);
>>> +void dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region
>>> *region,
>>> + int (*reclaim)(struct
>>> dmem_cgroup_pool_state *pool,
>>> + u64
>>> target_bytes, void *priv),
>>> + void *priv);
>>> #else
>>> static inline __printf(2,3) struct dmem_cgroup_region *
>>> dmem_cgroup_register_region(u64 size, const char *name_fmt, ...)
>>> @@ -62,5 +66,12 @@ bool dmem_cgroup_state_evict_valuable(struct
>>> dmem_cgroup_pool_state *limit_pool,
>>> static inline void dmem_cgroup_pool_state_put(struct
>>> dmem_cgroup_pool_state *pool)
>>> { }
>>>
>>> +static inline void
>>> +dmem_cgroup_region_set_reclaim(struct dmem_cgroup_region *region,
>>> + int (*reclaim)(struct
>>> dmem_cgroup_pool_state *pool,
>>> + u64 target_bytes,
>>> void *priv),
>>> + void *priv)
>>> +{ }
>>> +
>>> #endif
>>> #endif /* _CGROUP_DMEM_H */
>>> diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
>>> index 3e6d4c0b26a1..f993fb058b74 100644
>>> --- a/kernel/cgroup/dmem.c
>>> +++ b/kernel/cgroup/dmem.c
>>> @@ -51,6 +51,18 @@ struct dmem_cgroup_region {
>>> * No new pools should be added to the region afterwards.
>>> */
>>> bool unregistered;
>>> +
>>> + /**
>>> + * @reclaim: Optional callback invoked when dmem.max is
>>> set below the
>>> + * current usage of a pool. The driver should attempt to
>>> free at least
>>> + * @target_bytes from @pool. May be called multiple times
>>> if usage
>>> + * remains above the limit after returning.
>>> + */
>>> + int (*reclaim)(struct dmem_cgroup_pool_state *pool, u64
>>> target_bytes,
>>> + void *priv);
>>> +
>>> + /** @reclaim_priv: Private data passed to @reclaim. */
>>> + void *reclaim_priv;
>>> };
>>>
>>> struct dmemcg_state {
>>> @@ -145,23 +157,59 @@ static void free_cg_pool(struct
>>> dmem_cgroup_pool_state *pool)
>>> }
>>>
>>> static int
>>> -set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val)
>>> +set_resource_min(struct dmem_cgroup_pool_state *pool, u64 val,
>>> + struct dmem_cgroup_region *region)
>>> {
>>> page_counter_set_min(&pool->cnt, val);
>>> return 0;
>>> }
>>>
>>> static int
>>> -set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
>>> +set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val,
>>> + struct dmem_cgroup_region *region)
>>> {
>>> page_counter_set_low(&pool->cnt, val);
>>> return 0;
>>> }
>>>
>>> static int
>>> -set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
>>> +set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val,
>>> + struct dmem_cgroup_region *region)
>>> {
>>> - return page_counter_set_max(&pool->cnt, val);
>>> + int err = page_counter_set_max(&pool->cnt, val);
>>> +
>>> + if (err != -EBUSY || !region || !region->reclaim)
>>> + return err;
>>> +
>>> + /*
>>> + * The new max is below current usage. Ask the driver to
>>> evict memory
>>> + * and retry, up to a bounded number of times. Signal
>>> interruptions are
>>> + * propagated back to the write() caller; other reclaim
>>> failures leave
>>> + * -EBUSY as the result.
>>> + */
>>> + for (int retries = 5; retries > 0; retries--) {
>>> + u64 usage = page_counter_read(&pool->cnt);
>>> + u64 target = usage > val ? usage - val : 0;
>>> + int reclaim_err;
>>> +
>>> + if (!target) {
>>> + err = page_counter_set_max(&pool->cnt,
>>> val);
>>> + break;
>>> + }
>>> +
>>> + reclaim_err = region->reclaim(pool, target,
>>> region->reclaim_priv);
>>> + if (reclaim_err) {
>>> + if (reclaim_err == -EINTR || reclaim_err
>>> == -ERESTARTSYS)
>>> + err = reclaim_err;
>>> + break;
>>> + }
>>> +
>>> + err = page_counter_set_max(&pool->cnt, val);
>>> + if (err != -EBUSY)
>>> + break;
>>> + }
>>> +
>>> + return err;
>>> }
>>
>> I mentioned this in chat but I wanted to mention it on the mailing
>> list for others as well,
>> can we reproduce the behavior from memory_max_write() in
>> mm/memcontrol.c?
>>
>> 1. First set new limit through xchg.
>> 2. If O_NONBLOCK is set -> do nothing, next allocation in target
>> region will fail and cause reclaim.
>> 3. If not set -> reclaim until below new limit or interrupted by a
>> signal, return success in all cases here since we set new limit.
>>
>>
>
> Yup.
>
> For 3, we also need to consider the case where we fail to reclaim due
> to memory being pinned. If it's OK to (usually temporary) have current
> usage above max, that would work.
>
> I have that coded up and also add a patch on top to defer reclaim to a
> thread if we bail due to signal or O_NONBLOCK. Perhaps we could discuss
> whether that's a good or bad idea in that patch.
That doesn't sound like a good idea. The semantics of O_NONBLOCK
are deliberately intended to be able to change the max without causing
reclaim.
See the details in commit ("memcg: introduce non-blocking limit setting option")
I also believe it's ok not to continue reclaiming if aborted, the caller can
always try again if necessary.
If we want to deviate from the memcg controller, we need a very good reason
to do so. I'd like to keep the semantics the same if possible.
> Will send out when I've updated the IGT tests accordingly.
>
> Thanks,
> Thomas
Kind regards,
~Maarten Lankhorst