Re: [PATCH v1 1/4] mm: memcontrol: use helpers to access page's memcg data
From: Johannes Weiner
Date: Thu Sep 24 2020 - 15:46:45 EST
On Tue, Sep 22, 2020 at 01:36:57PM -0700, Roman Gushchin wrote:
> Currently there are many open-coded reads and writes of the
> page->mem_cgroup pointer, as well as a couple of read helpers,
> which are barely used.
>
> It creates an obstacle on a way to reuse some bits of the pointer
> for storing additional bits of information. In fact, we already do
> this for slab pages, where the last bit indicates that a pointer has
> an attached vector of objcg pointers instead of a regular memcg
> pointer.
>
> This commits introduces 4 new helper functions and converts all
> raw accesses to page->mem_cgroup to calls of these helpers:
> struct mem_cgroup *page_mem_cgroup(struct page *page);
> struct mem_cgroup *page_mem_cgroup_check(struct page *page);
> void set_page_mem_cgroup(struct page *page, struct mem_cgroup *memcg);
> void clear_page_mem_cgroup(struct page *page);
Sounds reasonable to me!
> page_mem_cgroup_check() is intended to be used in cases when the page
> can be a slab page and have a memcg pointer pointing at objcg vector.
> It does check the lowest bit, and if set, returns NULL.
> page_mem_cgroup() contains a VM_BUG_ON_PAGE() check for the page not
> being a slab page. So do set_page_mem_cgroup() and clear_page_mem_cgroup().
>
> To make sure nobody uses a direct access, struct page's
> mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
> Only new helpers and a couple of slab-accounting related functions
> access this field directly.
>
> page_memcg() and page_memcg_rcu() helpers defined in mm.h are removed.
> New page_mem_cgroup() is a direct analog of page_memcg(), while
> page_memcg_rcu() has a single call site in a small rcu-read-lock
> section, so it's just not worth it to have a separate helper. So
> it's replaced with page_mem_cgroup() too.
page_memcg_rcu() does READ_ONCE(). We need to keep that for lockless
accesses.
> @@ -343,6 +343,72 @@ struct mem_cgroup {
>
> extern struct mem_cgroup *root_mem_cgroup;
>
> +/*
> + * page_mem_cgroup - get the memory cgroup associated with a page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the memory cgroup associated with the page,
> + * or NULL. This function assumes that the page is known to have a
> + * proper memory cgroup pointer. It's not safe to call this function
> + * against some type of pages, e.g. slab pages or ex-slab pages.
> + */
> +static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
> +{
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + return (struct mem_cgroup *)page->memcg_data;
> +}
This would also be a good place to mention what's required for the
function to be called safely, or in a way that produces a stable
result - i.e. the list of conditions in commit_charge().
> + * page_mem_cgroup_check - get the memory cgroup associated with a page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the memory cgroup associated with the page,
> + * or NULL. This function unlike page_mem_cgroup() can take any page
> + * as an argument. It has to be used in cases when it's not known if a page
> + * has an associated memory cgroup pointer or an object cgroups vector.
> + */
> +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + /*
> + * The lowest bit set means that memcg isn't a valid
> + * memcg pointer, but a obj_cgroups pointer.
> + * In this case the page is shared and doesn't belong
> + * to any specific memory cgroup.
> + */
> + if (memcg_data & 0x1UL)
> + return NULL;
> +
> + return (struct mem_cgroup *)memcg_data;
> +}
Here as well.
> +
> +/*
> + * set_page_mem_cgroup - associate a page with a memory cgroup
> + * @page: a pointer to the page struct
> + * @memcg: a pointer to the memory cgroup
> + *
> + * Associates a page with a memory cgroup.
> + */
> +static inline void set_page_mem_cgroup(struct page *page,
> + struct mem_cgroup *memcg)
> +{
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + page->memcg_data = (unsigned long)memcg;
> +}
> +
> +/*
> + * clear_page_mem_cgroup - clear an association of a page with a memory cgroup
> + * @page: a pointer to the page struct
> + *
> + * Clears an association of a page with a memory cgroup.
> + */
> +static inline void clear_page_mem_cgroup(struct page *page)
> +{
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + page->memcg_data = 0;
> +}
> +
> static __always_inline bool memcg_stat_item_in_bytes(int idx)
> {
> if (idx == MEMCG_PERCPU_B)
> @@ -743,15 +809,15 @@ static inline void mod_memcg_state(struct mem_cgroup *memcg,
> static inline void __mod_memcg_page_state(struct page *page,
> int idx, int val)
> {
> - if (page->mem_cgroup)
> - __mod_memcg_state(page->mem_cgroup, idx, val);
> + if (page_mem_cgroup(page))
> + __mod_memcg_state(page_mem_cgroup(page), idx, val);
> }
>
> static inline void mod_memcg_page_state(struct page *page,
> int idx, int val)
> {
> - if (page->mem_cgroup)
> - mod_memcg_state(page->mem_cgroup, idx, val);
> + if (page_mem_cgroup(page))
> + mod_memcg_state(page_mem_cgroup(page), idx, val);
> }
>
> static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
> @@ -838,12 +904,12 @@ static inline void __mod_lruvec_page_state(struct page *page,
> struct lruvec *lruvec;
>
> /* Untracked pages have no memcg, no lruvec. Update only the node */
> - if (!head->mem_cgroup) {
> + if (!page_mem_cgroup(head)) {
> __mod_node_page_state(pgdat, idx, val);
> return;
> }
>
> - lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat);
> + lruvec = mem_cgroup_lruvec(page_mem_cgroup(head), pgdat);
> __mod_lruvec_state(lruvec, idx, val);
The repetition of the function call is a bit jarring, especially in
configs with VM_BUG_ON() enabled (some distros use it for their beta
release kernels, so it's not just kernel developer test machines that
pay this cost). Can you please use a local variable when the function
needs the memcg more than once?
> @@ -878,8 +944,8 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
> static inline void count_memcg_page_event(struct page *page,
> enum vm_event_item idx)
> {
> - if (page->mem_cgroup)
> - count_memcg_events(page->mem_cgroup, idx, 1);
> + if (page_mem_cgroup(page))
> + count_memcg_events(page_mem_cgroup(page), idx, 1);
> }
>
> static inline void count_memcg_event_mm(struct mm_struct *mm,
> @@ -941,6 +1007,25 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>
> struct mem_cgroup;
>
> +static inline struct mem_cgroup *page_mem_cgroup(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline struct mem_cgroup *page_mem_cgroup_check(struct page *page)
> +{
> + return NULL;
> +}
> +
> +static inline void set_page_mem_cgroup(struct page *page,
> + struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline void clear_page_mem_cgroup(struct page *page)
> +{
> +}
> +
> static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> {
> return true;
> @@ -1430,7 +1515,7 @@ static inline void mem_cgroup_track_foreign_dirty(struct page *page,
> if (mem_cgroup_disabled())
> return;
>
> - if (unlikely(&page->mem_cgroup->css != wb->memcg_css))
> + if (unlikely(&page_mem_cgroup(page)->css != wb->memcg_css))
> mem_cgroup_track_foreign_dirty_slowpath(page, wb);
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 17e712207d74..5e24ff2ffec9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1476,28 +1476,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
> #endif
> }
>
> -#ifdef CONFIG_MEMCG
> -static inline struct mem_cgroup *page_memcg(struct page *page)
> -{
> - return page->mem_cgroup;
> -}
> -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> -{
> - WARN_ON_ONCE(!rcu_read_lock_held());
> - return READ_ONCE(page->mem_cgroup);
> -}
> -#else
> -static inline struct mem_cgroup *page_memcg(struct page *page)
> -{
> - return NULL;
> -}
> -static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
> -{
> - WARN_ON_ONCE(!rcu_read_lock_held());
> - return NULL;
> -}
> -#endif
You essentially renamed these existing helpers, but I don't think
that's justified. Especially with the proliferation of callsites, the
original names are nicer. I'd prefer we keep them.
> @@ -560,16 +560,7 @@ ino_t page_cgroup_ino(struct page *page)
> unsigned long ino = 0;
>
> rcu_read_lock();
> - memcg = page->mem_cgroup;
> -
> - /*
> - * The lowest bit set means that memcg isn't a valid
> - * memcg pointer, but a obj_cgroups pointer.
> - * In this case the page is shared and doesn't belong
> - * to any specific memory cgroup.
> - */
> - if ((unsigned long) memcg & 0x1UL)
> - memcg = NULL;
> + memcg = page_mem_cgroup_check(page);
This should actually have been using READ_ONCE() all along. Otherwise
the compiler can issue multiple loads to page->mem_cgroup here and you
can end up with a pointer with the lowest bit set leaking out.
> @@ -2928,17 +2918,6 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
>
> page = virt_to_head_page(p);
>
> - /*
> - * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer
> - * or a pointer to obj_cgroup vector. In the latter case the lowest
> - * bit of the pointer is set.
> - * The page->mem_cgroup pointer can be asynchronously changed
> - * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
> - * from a valid memcg pointer to objcg vector or back.
> - */
> - if (!page->mem_cgroup)
> - return NULL;
> -
> /*
> * Slab objects are accounted individually, not per-page.
> * Memcg membership data for each individual object is saved in
> @@ -2956,8 +2935,14 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> return NULL;
> }
>
> - /* All other pages use page->mem_cgroup */
> - return page->mem_cgroup;
> + /*
> + * page_mem_cgroup_check() is used here, because page_has_obj_cgroups()
> + * check above could fail because the object cgroups vector wasn't set
> + * at that moment, but it can be set concurrently.
> + * page_mem_cgroup_check(page) will guarantee tat a proper memory
> + * cgroup pointer or NULL will be returned.
> + */
> + return page_mem_cgroup_check(page);
The code right now doesn't look quite safe. As per above, without the
READ_ONCE the compiler might issue multiple loads and we may get a
pointer with the low bit set.
Maybe slightly off-topic, but what are "all other pages" in general?
I don't see any callsites that ask for ownership on objects whose
backing pages may belong to a single memcg. That wouldn't seem to make
too much sense. Unless I'm missing something, this function should
probably tighten up its scope a bit and only work on stuff that is
actually following the obj_cgroup protocol.
I.e. either do the obj_cgroup lookup, or return root_mem_cgroup like
the other mem_cgroup_from_* functions.