Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap

From: Rik van Riel
Date: Thu Dec 12 2024 - 15:01:02 EST


On Thu, 12 Dec 2024 18:31:57 +0000
Roman Gushchin <roman.gushchin@xxxxxxxxx> wrote:

> Is it about a single task or groups of tasks or the entire cgroup?
> If former, why it's a problem? A tight memcg limit can slow things down
> in general and I don't see why we should treat the exit() path differently.
>
I think the exit path does need to be treated a little differently,
since this exit may be the only way such a cgroup can free up memory.

> If it's about the entire cgroup and we have essentially a deadlock,
> I feel like we need to look into the oom reaper side.

You mean something like the below?

I have not tested it yet, because we don't have any stuck
cgroups right now among the workloads that I'm monitoring.

---8<---

From c0e545fd45bd3ee24cd79b3d3e9b375e968ef460 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@xxxxxxxxxxx>
Date: Thu, 12 Dec 2024 14:50:49 -0500
Subject: [PATCH] memcg,oom: speed up reclaim for exiting tasks

When a memcg reaches its memory limit, and reclaim becomes unavailable
or slow for some reason, for example only zswap is available, but zswap
writeback is disabled, it can take a long time for tasks to exit, and
for the cgroup to get back to normal (or cleaned up).

Speed up memcg reclaim for exiting tasks by limiting how much work
reclaim does, and by invoking the OOM reaper if reclaim does not
free up enough memory to allow the task to make progress.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
---
include/linux/oom.h | 8 ++++++++
mm/memcontrol.c | 11 +++++++++++
mm/oom_kill.c | 6 +-----
3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1e0fc6931ce9..b2d9cf936664 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -111,4 +111,12 @@ extern void oom_killer_enable(void);

extern struct task_struct *find_lock_task_mm(struct task_struct *p);

+#ifdef CONFIG_MMU
+extern void queue_oom_reaper(struct task_struct *tsk);
+#else
+static intern void queue_oom_reaper(struct task_struct *tsk)
+{
+}
+#endif
+
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..21f42758d430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2231,6 +2231,9 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (!gfpflags_allow_blocking(gfp_mask))
goto nomem;

+ if (unlikely(current->flags & PF_EXITING))
+ gfp_mask |= __GFP_NORETRY;
+
memcg_memory_event(mem_over_limit, MEMCG_MAX);
raised_max_event = true;

@@ -2284,6 +2287,14 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto retry;
}
nomem:
+ /*
+ * We ran out of memory while inside exit. Maybe the OOM
+ * reaper can help reduce cgroup memory use and get things
+ * moving again?
+ */
+ if (unlikely(current->flags & PF_EXITING))
+ queue_oom_reaper(current);
+
/*
* Memcg doesn't have a dedicated reserve for atomic
* allocations. But like the global atomic pool, we need to
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1c485beb0b93..8d5278e45c63 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -686,7 +686,7 @@ static void wake_oom_reaper(struct timer_list *timer)
* before the exit path is able to wake the futex waiters.
*/
#define OOM_REAPER_DELAY (2*HZ)
-static void queue_oom_reaper(struct task_struct *tsk)
+void queue_oom_reaper(struct task_struct *tsk)
{
/* mm is already queued? */
if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -735,10 +735,6 @@ static int __init oom_init(void)
return 0;
}
subsys_initcall(oom_init)
-#else
-static inline void queue_oom_reaper(struct task_struct *tsk)
-{
-}
#endif /* CONFIG_MMU */

/**
--
2.47.0