[PATCH v3] cgroup/dmem: implement dmem.high soft limit via prioritized eviction

From: Qiliang Yuan

Date: Thu May 28 2026 - 08:09:19 EST


The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota. GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.

Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit). Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.

Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits(). Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.

Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter. When set, the function walks the page_counter parent
chain to check whether any ancestor exceeds its high limit, then
verifies that the pool is above its effective minimum to respect
dmem.min protection. Only pools meeting both criteria are evicted.

Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 uses trylock and targets only BOs whose cgroup exceeds
dmem.high. Pass 2 falls back to the standard above-elow eviction.
Pass 3 begins with a properly-locked high-priority pass in case
Pass 1 failed due to trylock contention, then proceeds with the
standard repeat-while-making-progress loop with low-watermark
fallback.

Signed-off-by: Qiliang Yuan <realwujing@xxxxxxxxx>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.

The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.

The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
child cgroups from evading the penalty when a parent cgroup exceeds
its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
so that actively-used over-limit BOs (which failed trylock in Pass 1)
are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
to match the behavior of get_resource_max().

V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
evicted first in a dedicated pass. No throttling or sleeping is
performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
(1) trylock: evict only BOs exceeding dmem.high
(2) trylock: above-elow
(3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().

v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@xxxxxxxxx
v2: https://lore.kernel.org/all/20260522-feature-dmem-high-v2-1-d805deddecbb@xxxxxxxxx
---
drivers/gpu/drm/ttm/ttm_bo.c | 35 ++++++++++++++++++++----
include/linux/cgroup_dmem.h | 4 +--
kernel/cgroup/dmem.c | 65 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..2f2b428f1d30a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {

/** @limit_pool: Which pool limit we should test against */
struct dmem_cgroup_pool_state *limit_pool;
+ /** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+ bool try_high;
/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
bool try_low;
/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
s64 lret;

if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
- evict_walk->try_low, &evict_walk->hit_low))
+ evict_walk->try_high, evict_walk->try_low,
+ &evict_walk->hit_low))
return 0;

if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,51 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
};
s64 lret;

+ /*
+ * Pass 1 (trylock): Only evict BOs whose cgroup is above its
+ * dmem.high soft limit. This penalizes over-limit cgroups first.
+ */
evict_walk.walk.arg.trylock_only = true;
+ evict_walk.try_high = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;

- /* One more attempt if we hit low limit? */
+ /*
+ * Pass 2 (trylock): Evict BOs above the effective low watermark.
+ * Falls back to low-priority eviction if needed.
+ */
+ lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
if (!lret && evict_walk.hit_low) {
evict_walk.try_low = true;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
}
+
if (lret || !ticket)
goto out;

- /* Reset low limit */
+ /*
+ * Pass 3+ (properly locked): Evict while making progress.
+ * First retry the high-priority pass with proper locking in case
+ * Pass 1 failed due to trylock contention on over-limit BOs.
+ * If that still fails, fall back to the standard low-priority eviction.
+ */
evict_walk.try_low = evict_walk.hit_low = false;
- /* If ticket-locking, repeat while making progress. */
evict_walk.walk.arg.trylock_only = false;
+ evict_walk.try_high = true;
+ lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+ evict_walk.try_high = false;
+ if (lret)
+ goto out;

retry:
do {
- /* The walk may clear the evict_walk.walk.ticket field */
evict_walk.walk.arg.ticket = ticket;
evict_walk.evicted = 0;
lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
} while (!lret && evict_walk.evicted);

- /* We hit the low limit? Try once more */
if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
evict_walk.try_low = true;
goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..06115d35509b1 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low);
+ bool try_high, bool ignore_low, bool *ret_hit_low);

void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
#else
@@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
static inline
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
return true;
}
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..c80444c0da177 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
page_counter_set_low(&pool->cnt, val);
}

+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+ page_counter_set_high(&pool->cnt, val);
+}
+
static void
set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
{
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
return pool ? READ_ONCE(pool->cnt.low) : 0;
}

+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+ return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
{
return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
{
set_resource_min(rpool, 0);
set_resource_low(rpool, 0);
+ set_resource_high(rpool, PAGE_COUNTER_MAX);
set_resource_max(rpool, PAGE_COUNTER_MAX);
}

@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
* dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
* @limit_pool: The pool for which we hit limits
* @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
* @ignore_low: Whether we have to respect low watermarks.
* @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
*
* This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
* When returning false and @ignore_low is false, @ret_hit_low may
* be set to true to indicate this function can be retried with @ignore_low
* set to true.
@@ -301,7 +316,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
*/
bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
struct dmem_cgroup_pool_state *test_pool,
- bool ignore_low, bool *ret_hit_low)
+ bool try_high, bool ignore_low, bool *ret_hit_low)
{
struct dmem_cgroup_pool_state *pool = test_pool;
struct page_counter *ctest;
@@ -331,9 +346,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,

ctest = &test_pool->cnt;

+ used = page_counter_read(ctest);
+
+ if (try_high) {
+ struct page_counter *c;
+
+ /*
+ * Walk the page_counter parent chain to check whether any
+ * ancestor cgroup exceeds its dmem.high limit. This prevents
+ * child cgroups from evading the penalty when a parent cgroup
+ * is over its high limit.
+ */
+ if (used <= READ_ONCE(ctest->high)) {
+ for (c = ctest->parent; c; c = c->parent) {
+ if (page_counter_read(c) > READ_ONCE(c->high))
+ break;
+ }
+ if (!c)
+ return false;
+ }
+
+ /*
+ * Respect dmem.min protection: do not evict BOs below the
+ * effective minimum even during the high-priority pass.
+ */
+ dmem_cgroup_calculate_protection(limit_pool, test_pool);
+ min = READ_ONCE(ctest->emin);
+
+ return used > min;
+ }
+
dmem_cgroup_calculate_protection(limit_pool, test_pool);

- used = page_counter_read(ctest);
min = READ_ONCE(ctest->emin);

if (used <= min)
@@ -835,6 +879,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
}

+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+ return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
{
return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +923,12 @@ static struct cftype files[] = {
.seq_show = dmem_cgroup_region_low_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
+ {
+ .name = "high",
+ .write = dmem_cgroup_region_high_write,
+ .seq_show = dmem_cgroup_region_high_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
{
.name = "max",
.write = dmem_cgroup_region_max_write,

---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38

Best regards,
--
Qiliang Yuan <realwujing@xxxxxxxxx>