Re: [PATCH 2/2] mm: make compound_head() robust

From: Hugh Dickins
Date: Sun Aug 16 2015 - 21:41:41 EST


On Thu, 13 Aug 2015, Kirill A. Shutemov wrote:

> Hugh has pointed that compound_head() call can be unsafe in some
> context. There's one example:
>
> CPU0 CPU1
>
> isolate_migratepages_block()
> page_count()
> compound_head()
> !!PageTail() == true
> put_page()
> tail->first_page = NULL
> head = tail->first_page
> alloc_pages(__GFP_COMP)
> prep_compound_page()
> tail->first_page = head
> __SetPageTail(p);
> !!PageTail() == true
> <head == NULL dereferencing>
>
> The race is pure theoretical. I don't it's possible to trigger it in
> practice. But who knows.
>
> We can fix the race by chaging how encode PageTail() and compound_head()
> within struct page to be able to update them in one shot.
>
> The patch introduces page->compound_head into union with
> page->mem_cgroup.
>
> Set bit 0 of page->compound_head means that the page is tail. If the bit
> set, rest of the page->compound_head is pointer to head page. Otherwise,
> the field is NULL or pointer to memory cgroup.
>
> page->mem_cgroup currenly only used for small or head pages, so there
> shouldn't be any conflicts.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: Vlastimil Babka <vbabka@xxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> ---
> Documentation/vm/split_page_table_lock | 4 +-
> arch/xtensa/configs/iss_defconfig | 1 -
> include/linux/mm.h | 53 ++--------------------
> include/linux/mm_types.h | 15 ++++---
> include/linux/page-flags.h | 80 ++++++++--------------------------
> mm/Kconfig | 12 -----
> mm/debug.c | 7 ---
> mm/hugetlb.c | 8 +---
> mm/internal.h | 4 +-
> mm/memory-failure.c | 7 ---
> mm/page_alloc.c | 36 +++++++--------
> mm/swap.c | 4 +-
> 12 files changed, 56 insertions(+), 175 deletions(-)

Mostly I like this: especially those deletions,
and removing the unloved CONFIG_PAGEFLAGS_EXTENDED.

But I do disagree with:

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0038ac7466fd..e0c4c0a8ec3d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -172,13 +172,17 @@ struct page {
> #endif
> #endif
> struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
> - struct page *first_page; /* Compound tail pages */
> };
>
> -#ifdef CONFIG_MEMCG
> - struct mem_cgroup *mem_cgroup;
> -#endif
> + union {
> + /* Bit zero of the word encode PageTail() */
> + struct mem_cgroup *mem_cgroup; /* If bit zero is clear */
> + unsigned long compound_head; /* If bit zero is set */
> + };

On 32-bit, I think the addition of that mem_cgroup pointer enlarged
struct page from 32 bytes to 36 (SLAB or SLOB) or 40 (SLUB) bytes.
I can well imagine people wanting to cut that bloat by turning off
CONFIG_MEMCG, but now you would be thwarting them. (I can also
imagine memcg people might want to add flag bits of their own to it.)

My own preference (Andrew might disagree) would be to give up on
compound_page_dtor *compound_dtor: in all the years it's been there,
it has only been assigned two possibilities, and I think you'd do
well to hard code those in the one or two places they're needed -
moving PageHuge to be a second bit alongside your PageTail (but of
course it could only be set in the first tail page, not the head).

But as far as fixing the isolate_migratepages_block() bug you
discovered, I think your original atomic_read(&page->_count) fix
was good enough, or a page_count_raw(page) if Andrew prefers -
though I'm not so keen on dressing these things up myself, such
raw scans are exceptional and face very special problems, I like
to see exactly what's going on in them.

And even when a race between PageTail and first_page is resolved
by your READ_ONCE, it still leaves races of whether the head page
still agrees with the tail. Safer races now because first_page is
reliably a recent first_page pointer: which might be all that's needed
to make your refcounting patches' wider use of compound_head() safe -
this patch does look like a good step for those.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/