Re: [PATCH] memcg: allow exiting tasks to write back data to swap

From: Yosry Ahmed
Date: Wed Dec 11 2024 - 12:01:28 EST


On Wed, Dec 11, 2024 at 8:34 AM Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Wed, 2024-12-11 at 08:26 -0800, Yosry Ahmed wrote:
> > On Wed, Dec 11, 2024 at 7:54 AM Rik van Riel <riel@xxxxxxxxxxx>
> > wrote:
> > >
> > > +++ b/mm/memcontrol.c
> > > @@ -5371,6 +5371,15 @@ bool
> > > mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > > if (!zswap_is_enabled())
> > > return true;
> > >
> > > + /*
> > > + * Always allow exiting tasks to push data to swap. A
> > > process in
> > > + * the middle of exit cannot get OOM killed, but may need
> > > to push
> > > + * uncompressible data to swap in order to get the cgroup
> > > memory
> > > + * use below the limit, and make progress with the exit.
> > > + */
> > > + if ((current->flags & PF_EXITING) && memcg ==
> > > mem_cgroup_from_task(current))
> > > + return true;
> > > +
> >
> > I have a few questions:
> > (a) If the task is being OOM killed it should be able to charge
> > memory
> > beyond memory.max, so why do we need to get the usage down below the
> > limit?
> >
> If it is a kernel directed memcg OOM kill, that is
> true.
>
> However, if the exit comes from somewhere else,
> like a userspace oomd kill, we might not hit that
> code path.

Why do we treat dying tasks differently based on the source of the kill?

>
> > Looking at the other thread with Michal, it looks like it's because
> > we
> > have to go into reclaim first before we get to the point of force
> > charging for dying tasks, and we spend too much time in reclaim. Is
> > that correct?
> >
> > If that's the case, I am wondering if the real problem is that we
> > check mem_cgroup_zswap_writeback_enabled() too late in the process.
> > Reclaim ages the LRUs, isolates pages, unmaps them, allocates swap
> > entries, only to realize it cannot swap in swap_writepage().
> >
> > Should we check for this in can_reclaim_anon_pages()? If zswap
> > writeback is disabled and we are already at the memcg limit (or zswap
> > limit for that matter), we should avoid scanning anon memory to begin
> > with. The problem is that if we race with memory being freed we may
> > have some extra OOM kills, but I am not sure how common this case
> > would be.
>
> However, we don't know until the attempted zswap write
> whether the memory is compressible, and whether doing
> a bunch of zswap writes will help us bring our memcg
> down below its memory.max limit.

If we are at memory.max (or memory.zswap.max), we can't compress pages
into zswap anyway, regardless of their compressibility.

So what I am saying is, if we are already at the limit (pages cannot
go into zswap), and writeback is disabled (pages cannot go into
swapfiles), then we should probably avoid scanning the anon LRUs and
spending all those wasted cycles trying to isolate, unmap, and reclaim
them only to fail at the last step.

>
> >
> > (b) Should we use mem_cgroup_is_descendant() or mm_match_memcg() in
> > case we are reclaiming from an ancestor and we hit the limit of that
> > ancestor?
> >
> I don't know if we need or want to reclaim from any
> other memcgs than those of the exiting process itself.
>
> A small blast radius seems like it could be desirable,
> but I'm open to other ideas :)

The exiting process is part of all the ancestor cgroups by the hierarchy.

If we have the following hierarchy:
root
|
A
|
B

Then a process in cgroup B could be getting OOM killed due to hitting
the limit of A, not B. In which case, reclaiming from A helps us get
below the limit. We can check if the cgroup is an ancestor and it hit
its limit, but maybe that's an overkill.

>
> > (c) mem_cgroup_from_task() should be called in an RCU read section
> > (or
> > we need something like rcu_access_point() if we are not dereferencing
> > the pointer).
> >
> I'll add this in v2.
>
> --
> All Rights Reversed.