Re: [PATCH v1 1/4] mm: memcontrol: use helpers to access page's memcg data

From: Roman Gushchin
Date: Thu Sep 24 2020 - 16:27:22 EST


On Thu, Sep 24, 2020 at 03:45:08PM -0400, Johannes Weiner wrote:
> On Tue, Sep 22, 2020 at 01:36:57PM -0700, Roman Gushchin wrote:
> > Currently there are many open-coded reads and writes of the
> > page->mem_cgroup pointer, as well as a couple of read helpers,
> > which are barely used.
> >
> > It creates an obstacle on a way to reuse some bits of the pointer
> > for storing additional bits of information. In fact, we already do
> > this for slab pages, where the last bit indicates that a pointer has
> > an attached vector of objcg pointers instead of a regular memcg
> > pointer.
> >
> > This commits introduces 4 new helper functions and converts all
> > raw accesses to page->mem_cgroup to calls of these helpers:
> > struct mem_cgroup *page_mem_cgroup(struct page *page);
> > struct mem_cgroup *page_mem_cgroup_check(struct page *page);
> > void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg);
> > void clear_page_mem_cgroup(struct page *page);
>
> Sounds reasonable to me!
>
> > page_mem_cgroup_check() is intended to be used in cases when the page
> > can be a slab page and have a memcg pointer pointing at objcg vector.
> > It does check the lowest bit, and if set, returns NULL.
> > page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not
> > being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup().
> >
> > To make sure nobody uses a direct access, struct page's
> > mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
> > Only new helpers and a couple of slab-accounting related functions
> > access this field directly.
> >
> > page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed.
> > New page_mem_cgroup() is a direct analog of page_memcg(), while
> > page_memcg_rcu() has a single call site in a small rcu-read-lock
> > section, so it's just not worth it to have a separate helper. So
> > it's replaced with page_mem_cgroup() too.
>
> page_memcg_rcu() does READ_ONCE(). We need to keep that for lockless
> accesses.

Ok, how about page_memcg() and page_objcgs() which always do READ_ONCE()?
Because page_memcg_rcu() has only a single call site, I would prefer to
have one helper instead of two.

>
> > @@ -343,6 +343,72 @@ struct mem_cgroup {
> >
> > extern struct mem_cgroup *root_mem_cgroup;
> >
> > +/*
> > + * page_mem_cgroup - get the memory cgroup associated with a page
> > + * @page: a pointer to the page struct
> > + *
> > + * Returns a pointer to the memory cgroup associated with the page,
> > + * or NULL. This function assumes that the page is known to have a
> > + * proper memory cgroup pointer. It's not safe to call this function
> > + * against some type of pages, e.g. slab pages or ex-slab pages.
> > + */
> > +static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
> > +{
> > + VM_BUG_ON_PAGE(PageSlab(page), page);
> > + return (struct mem_cgroup *)page->memcg_data;
> > +}
>
> This would also be a good place to mention what's required for the
> function to be called safely, or in a way that produces a stable
> result - i.e. the list of conditions in commit_charge().

Makes sense.

>
> > + * page_mem_cgroup_check - get the memory cgroup associated with a page
> > + * @page: a pointer to the page struct
> > + *
> > + * Returns a pointer to the memory cgroup associated with the page,
> > + * or NULL. This function unlike page_mem_cgroup() can take any page
> > + * as an argument. It has to be used in cases when it's not known if a page
> > + * has an associated memory cgroup pointer or an object cgroups vector.
> > + */
> > +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
> > +{
> > + unsigned long memcg_data = page->memcg_data;
> > +
> > + /*
> > + * The lowest bit set means that memcg isn't a valid
> > + * memcg pointer, but a obj_cgroups pointer.
> > + * In this case the page is shared and doesn't belong
> > + * to any specific memory cgroup.
> > + */
> > + if (memcg_data & 0x1UL)
> > + return NULL;
> > +
> > + return (struct mem_cgroup *)memcg_data;
> > +}
>
> Here as well.
>
> > +
> > +/*
> > + * set_page_mem_cgroup - associate a page with a memory cgroup
> > + * @page: a pointer to the page struct
> > + * @memcg: a pointer to the memory cgroup
> > + *
> > + * Associates a page with a memory cgroup.
> > + */
> > +static inline void set_page_mem_cgroup(struct page *page,
> > + struct mem_cgroup *memcg)
> > +{
> > + VM_BUG_ON_PAGE(PageSlab(page), page);
> > + page->memcg_data = (unsigned long)memcg;
> > +}
> > +
> > +/*
> > + * clear_page_mem_cgroup - clear an association of a page with a memory cgroup
> > + * @page: a pointer to the page struct
> > + *
> > + * Clears an association of a page with a memory cgroup.
> > + */
> > +static inline void clear_page_mem_cgroup(struct page *page)
> > +{
> > + VM_BUG_ON_PAGE(PageSlab(page), page);
> > + page->memcg_data = 0;
> > +}
> > +
> > static __always_inline bool memcg_stat_item_in_bytes(int idx)
> > {
> > if (idx == MEMCG_PERCPU_B)
> > @@ -743,15 +809,15 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
> > static inline void __mod_memcg_page_state(struct page *page,
> > int idx, int val)
> > {
> > - if (page->mem_cgroup)
> > - __mod_memcg_state(page->mem_cgroup, idx, val);
> > + if (page_mem_cgroup(page))
> > + __mod_memcg_state(page_mem_cgroup(page), idx, val);
> > }
> >
> > static inline void mod_memcg_page_state(struct page *page,
> > int idx, int val)
> > {
> > - if (page->mem_cgroup)
> > - mod_memcg_state(page->mem_cgroup, idx, val);
> > + if (page_mem_cgroup(page))
> > + mod_memcg_state(page_mem_cgroup(page), idx, val);
> > }
> >
> > static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
> > @@ -838,12 +904,12 @@ static inline void __mod_lruvec_page_state(struct page *page,
> > struct lruvec *lruvec;
> >
> > /* Untracked pages have no memcg, no lruvec. Update only the node */
> > - if (!head->mem_cgroup) {
> > + if (!page_mem_cgroup(head)) {
> > __mod_node_page_state(pgdat, idx, val);
> > return;
> > }
> >
> > - lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat);
> > + lruvec = mem_cgroup_lruvec(page_mem_cgroup(head), pgdat);
> > __mod_lruvec_state(lruvec, idx, val);
>
> The repetition of the function call is a bit jarring, especially in
> configs with VM_BUG_ON() enabled (some distros use it for their beta
> release kernels, so it's not just kernel developer test machines that
> pay this cost). Can you please use a local variable when the function
> needs the memcg more than once?

Sure.

>
> > @@ -878,8 +944,8 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
> > static inline void count_memcg_page_event(struct page *page,
> > enum vm_event_item idx)
> > {
> > - if (page->mem_cgroup)
> > - count_memcg_events(page->mem_cgroup, idx, 1);
> > + if (page_mem_cgroup(page))
> > + count_memcg_events(page_mem_cgroup(page), idx, 1);
> > }
> >
> > static inline void count_memcg_event_mm(struct mm_struct *mm,
> > @@ -941,6 +1007,25 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> >
> > struct mem_cgroup;
> >
> > +static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
> > +{
> > + return NULL;
> > +}
> > +
> > +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
> > +{
> > + return NULL;
> > +}
> > +
> > +static inline void set_page_mem_cgroup(struct page *page,
> > + struct mem_cgroup *memcg)
> > +{
> > +}
> > +
> > +static inline void clear_page_mem_cgroup(struct page *page)
> > +{
> > +}
> > +
> > static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> > {
> > return true;
> > @@ -1430,7 +1515,7 @@ static inline void mem_cgroup_track_foreign_dirty(struct page *page,
> > if (mem_cgroup_disabled())
> > return;
> >
> > - if (unlikely(&page->mem_cgroup->css != wb->memcg_css))
> > + if (unlikely(&page_mem_cgroup(page)->css != wb->memcg_css))
> > mem_cgroup_track_foreign_dirty_slowpath(page, wb);
> > }
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 17e712207d74..5e24ff2ffec9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1476,28 +1476,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> > #endif
> > }
> >
> > -#ifdef CONFIG_MEMCG
> > -static inline struct mem_cgroup *page_memcg(struct page *page)
> > -{
> > - return page->mem_cgroup;
> > -}
> > -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > -{
> > - WARN_ON_ONCE(!rcu_read_lock_held());
> > - return READ_ONCE(page->mem_cgroup);
> > -}
> > -#else
> > -static inline struct mem_cgroup *page_memcg(struct page *page)
> > -{
> > - return NULL;
> > -}
> > -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> > -{
> > - WARN_ON_ONCE(!rcu_read_lock_held());
> > - return NULL;
> > -}
> > -#endif
>
> You essentially renamed these existing helpers, but I don't think
> that's justified. Especially with the proliferation of callsites, the
> original names are nicer. I'd prefer we keep them.
>
> > @@ -560,16 +560,7 @@ ino_t page_cgroup_ino(struct page *page)
> > unsigned long ino = 0;
> >
> > rcu_read_lock();
> > - memcg = page->mem_cgroup;
> > -
> > - /*
> > - * The lowest bit set means that memcg isn't a valid
> > - * memcg pointer, but a obj_cgroups pointer.
> > - * In this case the page is shared and doesn't belong
> > - * to any specific memory cgroup.
> > - */
> > - if ((unsigned long) memcg & 0x1UL)
> > - memcg = NULL;
> > + memcg = page_mem_cgroup_check(page);
>
> This should actually have been using READ_ONCE() all along. Otherwise
> the compiler can issue multiple loads to page->mem_cgroup here and you
> can end up with a pointer with the lowest bit set leaking out.
>
> > @@ -2928,17 +2918,6 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> >
> > page = virt_to_head_page(p);
> >
> > - /*
> > - * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer
> > - * or a pointer to obj_cgroup vector. In the latter case the lowest
> > - * bit of the pointer is set.
> > - * The page->mem_cgroup pointer can be asynchronously changed
> > - * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
> > - * from a valid memcg pointer to objcg vector or back.
> > - */
> > - if (!page->mem_cgroup)
> > - return NULL;
> > -
> > /*
> > * Slab objects are accounted individually, not per-page.
> > * Memcg membership data for each individual object is saved in
> > @@ -2956,8 +2935,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> > return NULL;
> > }
> >
> > - /* All other pages use page->mem_cgroup */
> > - return page->mem_cgroup;
> > + /*
> > + * page_mem_cgroup_check() is used here, because page_has_obj_cgroups()
> > + * check above could fail because the object cgroups vector wasn't set
> > + * at that moment, but it can be set concurrently.
> > + * page_mem_cgroup_check(page) will guarantee tat a proper memory
> > + * cgroup pointer or NULL will be returned.
> > + */
> > + return page_mem_cgroup_check(page);
>
> The code right now doesn't look quite safe. As per above, without the
> READ_ONCE the compiler might issue multiple loads and we may get a
> pointer with the low bit set.
>
> Maybe slightly off-topic, but what are "all other pages" in general?
> I don't see any callsites that ask for ownership on objects whose
> backing pages may belong to a single memcg. That wouldn't seem to make
> too much sense. Unless I'm missing something, this function should
> probably tighten up its scope a bit and only work on stuff that is
> actually following the obj_cgroup protocol.

Kernel stacks can be slabs or generic pages/vmallocs. Also large kmallocs
are using the page allocator, so they don't follow the objcg protocol.

Thanks!