Re: [PATCH 0/2] hugetlb memcg accounting

From: Michal Hocko
Date: Mon Oct 02 2023 - 10:58:28 EST


On Mon 02-10-23 10:42:50, Johannes Weiner wrote:
> On Sun, Oct 01, 2023 at 04:27:30PM -0700, Mike Kravetz wrote:
> > On 09/27/23 14:47, Johannes Weiner wrote:
> > > On Wed, Sep 27, 2023 at 01:21:20PM +0200, Michal Hocko wrote:
> > > > On Tue 26-09-23 12:49:47, Nhat Pham wrote:
> > >
> > > So that if you use 80% hugetlb, the other memory is forced to stay in
> > > the remaining 20%, or it OOMs; and that if you don't use hugetlb, the
> > > group is still allowed to use the full 100% of its host memory
> > > allowance, without requiring some outside agent continuously
> > > monitoring and adjusting the container limits.
> >
> > Jumping in late here as I was traveling last week. In addition, I want
> > to state my limited cgroup knowledge up front.
> >
> > I was thinking of your scenario above a little differently. Suppose a
> > group is up and running at almost 100% memory usage. However, the majority
> > of that memory is reclaimable. Now, someone wants to allocate a 2M hugetlb
> > page. There is not 2MB free, but we could easily reclaim 2MB to make room
> > for the hugetlb page. I may be missing something, but I do not see how that
> > is going to happen. It seems like we would really want that behavior.
>
> But that is actually what it does, no?
>
> alloc_hugetlb_folio
> mem_cgroup_hugetlb_charge_folio
> charge_memcg
> try_charge
> !page_counter_try_charge ?
> !try_to_free_mem_cgroup_pages ?
> mem_cgroup_oom
>
> So it does reclaim when the hugetlb hits the cgroup limit. And if that
> fails to make room, it OOMs the cgroup.
>
> Or maybe I'm missing something?

I beleve that Mike alludes to what I have pointed in other email:
http://lkml.kernel.org/r/ZRrI90KcRBwVZn/r@xxxxxxxxxxxxxx and a situation
when the hugetlb requests results in an acutal hugetlb allocation rather
than consumption from the pre-allocated pool. In that case memcg is not
involved because the charge happens only after the allocation happens.
That btw. means that this request could disrupt a different memcg even
if the current one is at the limit or it could be reclaimed instead.

Also there is not OOM as hugetlb pages are costly requests and we do not
invoke the oom killer.

--
Michal Hocko
SUSE Labs