Re: [PATCH v4] hugetlb: Add hugetlb.*.numa_stat file
From: Mina Almasry
Date: Wed Nov 10 2021 - 17:59:40 EST
On Tue, Nov 9, 2021 at 10:15 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> On Mon, Nov 8, 2021 at 1:05 PM Mina Almasry <almasrymina@xxxxxxxxxx> wrote:
> >
> > For hugetlb backed jobs/VMs it's critical to understand the numa
> > information for the memory backing these jobs to deliver optimal
> > performance.
> >
> > Currently this technically can be queried from /proc/self/numa_maps, but
> > there are significant issues with that. Namely:
> > 1. Memory can be mapped on unmapped.
>
> or*
>
> > 2. numa_maps are per process and need to be aggregated across all
> > processes in the cgroup. For shared memory this is more involved as
> > the userspace needs to make sure it doesn't double count shared
> > mappings.
> > 3. I believe querying numa_maps needs to hold the mmap_lock which adds
> > to the contention on this lock.
> >
> > For these reasons I propose simply adding hugetlb.*.numa_stat file,
> > which shows the numa information of the cgroup similarly to
> > memory.numa_stat.
> >
> > On cgroup-v2:
> > cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
> > total=2097152 N0=2097152 N1=0
> >
> > On cgroup-v1:
> > cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
> > total=2097152 N0=2097152 N1=0
> > hierarichal_total=2097152 N0=2097152 N1=0
> >
> > This patch was tested manually by allocating hugetlb memory and querying
> > the hugetlb.*.numa_stat file of the cgroup and its parents.
> > 
> > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > Cc: Shuah Khan <shuah@xxxxxxxxxx>
> > Cc: Miaohe Lin <linmiaohe@xxxxxxxxxx>
> > Cc: Oscar Salvador <osalvador@xxxxxxx>
> > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > Cc: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> > Cc: David Rientjes <rientjes@xxxxxxxxxx>
> > Cc: Shakeel Butt <shakeelb@xxxxxxxxxx>
> > Cc: Jue Wang <juew@xxxxxxxxxx>
> > Cc: Yang Yao <ygyao@xxxxxxxxxx>
> > Cc: Joanna Li <joannali@xxxxxxxxxx>
> > Cc: Cannon Matthews <cannonmatthews@xxxxxxxxxx>
> > Cc: linux-mm@xxxxxxxxx
> > Cc: linux-kernel@xxxxxxxxxxxxxxx
> >
> > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx>
> >
> [...]
>
> > diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> > index c137396129db..54ff6ec68ed3 100644
> > --- a/include/linux/hugetlb_cgroup.h
> > +++ b/include/linux/hugetlb_cgroup.h
> > @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
> > HUGETLB_NR_MEMORY_EVENTS,
> > };
> >
> > +struct hugetlb_cgroup_per_node {
> > + /* hugetlb usage in bytes over all hstates. */
> > + unsigned long usage[HUGE_MAX_HSTATE];
>
> Should we use atomic here? Is this safe for all archs?
>
> [...]
>
> >
> > /*
> > @@ -292,7 +321,8 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> > return;
> >
> > __set_hugetlb_cgroup(page, h_cg, rsvd);
> > - return;
> > + if (!rsvd && h_cg)
>
> No need to check h_cg.
>
> > + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
> > }
> >
> > void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> > @@ -331,7 +361,8 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> >
> > if (rsvd)
> > css_put(&h_cg->css);
> > -
> > + else
> > + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;
> > return;
> > }
> >
> > @@ -421,6 +452,58 @@ enum {
> > RES_RSVD_FAILCNT,
> > };
> >
> > +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> > +{
> > + int nid;
> > + struct cftype *cft = seq_cft(seq);
> > + int idx = MEMFILE_IDX(cft->private);
> > + bool legacy = MEMFILE_ATTR(cft->private);
> > + struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> > + struct cgroup_subsys_state *css;
> > + unsigned long usage;
> > +
> > + if (legacy) {
> > + /* Add up usage across all nodes for the non-hierarchical total. */
> > + usage = 0;
> > + for_each_node_state(nid, N_MEMORY)
> > + usage += h_cg->nodeinfo[nid]->usage[idx];
> > + seq_printf(seq, "total=%lu", usage * PAGE_SIZE);
> > +
> > + /* Simply print the per-node usage for the non-hierarchical total. */
> > + for_each_node_state(nid, N_MEMORY)
> > + seq_printf(seq, " N%d=%lu", nid,
> > + h_cg->nodeinfo[nid]->usage[idx] * PAGE_SIZE);
> > + seq_putc(seq, '\n');
> > + }
> > +
> > + /*
> > + * The hierarchical total is pretty much the value recorded by the
> > + * counter, so use that.
> > + */
> > + seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> > + page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
> > +
> > + /*
> > + * For each node, transverse the css tree to obtain the hierarichal
> > + * node usage.
> > + */
> > + for_each_node_state(nid, N_MEMORY) {
> > + usage = 0;
> > + rcu_read_lock();
> > + css_for_each_descendant_pre(css, &h_cg->css) {
>
> This will be slow. Do you really want to traverse the hugetlb-cgroup
> tree that many times? We had similar issues with memcg stats as well
> but got resolved with rstat.
>
> Though I feel like hugetlb-cgroup is not that worried about
> performance. There is no batching in the charging and uncharging
> codepaths.
>
I'm not seeing a way to get this info without paying the cost to
transverse the tree somewhere (either at charge time or at collection
time). I think maybe we can wait and see if there are issues with the
performance and if so we can adopt the memcg solutions or something
else.
Thank you for reviewing. I'm about to upload v5 with fixes.
> > + usage += hugetlb_cgroup_from_css(css)
> > + ->nodeinfo[nid]
> > + ->usage[idx];
> > + }
> > + rcu_read_unlock();
> > + seq_printf(seq, " N%d=%lu", nid, usage * PAGE_SIZE);
> > + }
> > +
> > + seq_putc(seq, '\n');
> > +
> > + return 0;
> > +}