Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems
From: Matthew Wilcox
Date: Sun Apr 11 2021 - 06:33:33 EST
On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 10 Apr 2021 21:52:45 +0100
> "Matthew Wilcox (Oracle)" <willy@xxxxxxxxxxxxx> wrote:
>
> > 32-bit architectures which expect 8-byte alignment for 8-byte integers
> > and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> > page inadvertently expanded in 2019. When the dma_addr_t was added,
> > it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> > gap between 'flags' and the union.
> >
> > We could fix this by telling the compiler to use a smaller alignment
> > for the dma_addr, but that seems a little fragile. Instead, move the
> > 'flags' into the union. That causes dma_addr to shift into the same
> > bits as 'mapping', so it would have to be cleared on free. To avoid
> > this, insert three words of padding and use the same bits as ->index
> > and ->private, neither of which have to be cleared on free.
> >
> > Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> > ---
> > include/linux/mm_types.h | 38 ++++++++++++++++++++++++++------------
> > 1 file changed, 26 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..45c563e9b50e 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -68,16 +68,22 @@ struct mem_cgroup;
> > #endif
> >
> > struct page {
> > - unsigned long flags; /* Atomic flags, some possibly
> > - * updated asynchronously */
> > /*
> > - * Five words (20/40 bytes) are available in this union.
> > - * WARNING: bit 0 of the first word is used for PageTail(). That
> > - * means the other users of this union MUST NOT use the bit to
> > + * This union is six words (24 / 48 bytes) in size.
> > + * The first word is reserved for atomic flags, often updated
> > + * asynchronously. Use the PageFoo() macros to access it. Some
> > + * of the flags can be reused for your own purposes, but the
> > + * word as a whole often contains other information and overwriting
> > + * it will cause functions like page_zone() and page_node() to stop
> > + * working correctly.
> > + *
> > + * Bit 0 of the second word is used for PageTail(). That
> > + * means the other users of this union MUST leave the bit zero to
> > * avoid collision and false-positive PageTail().
> > */
> > union {
> > struct { /* Page cache and anonymous pages */
> > + unsigned long flags;
> > /**
> > * @lru: Pageout list, eg. active_list protected by
> > * lruvec->lru_lock. Sometimes used as a generic list
> > @@ -96,13 +102,14 @@ struct page {
> > unsigned long private;
> > };
> > struct { /* page_pool used by netstack */
> > - /**
> > - * @dma_addr: might require a 64-bit value even on
> > - * 32-bit architectures.
> > - */
> > - dma_addr_t dma_addr;
>
> The original intend of placing member @dma_addr here is that it overlap
> with @LRU (type struct list_head) which contains two pointers. Thus, in
> case of CONFIG_ARCH_DMA_ADDR_T_64BIT=y on 32-bit architectures it would
> use both pointers.
>
> Thinking more about this, this design is flawed as bit 0 of the first
> word is used for compound pages (see PageTail and @compound_head), is
> reserved. We knew DMA addresses were aligned, thus we though this
> satisfied that need. BUT for DMA_ADDR_T_64BIT=y on 32-bit arch the
> first word will contain the "upper" part of the DMA addr, which I don't
> think gives this guarantee.
>
> I guess, nobody are using this combination?!? I though we added this
> to satisfy TI (Texas Instrument) driver cpsw (code in
> drivers/net/ethernet/ti/cpsw*). Thus, I assumed it was in use?
It may be in use, but we've got away with it? It's relatively rare
that this is going to bite us. I think what has to happen is:
page is mapped to userspace
task calls get_user_page_fast(), loads the PTE
<preempted>
page is unmapped & freed
page is reallocated to the page_pool
page is DMA mapped to an address that happens to have that bit set
<first task resumes>
task looks for the compound_head() of that PTE, and attempts to bump
the refcount. *oops*
If it has happened, would it have turned into a bug report?
If we had seen such a bug report, would we have noticed it?
> > + unsigned long _pp_flags;
> > + unsigned long pp_magic;
> > + unsigned long xmi;
>
> Matteo notice, I think intent is we can store xdp_mem_info in @xmi.
Yep.
> > + unsigned long _pp_mapping_pad;
> > + dma_addr_t dma_addr; /* might be one or two words */
> > };
>
> Could you explain your intent here?
> I worry about @index.
>
> As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> (code copy-pasted below signature) which imply that the member @index
> have to be kept intact. In above, I'm unsure @index is untouched.
Argh, I read that piece of your message, and then promptly forgot about
it. I really don't like page_is_pfmemalloc() using the entirety of
page->index for this. How about we just do what slab does anyway
and use PageActive for page_is_pfmemalloc()?
Basically, we have three aligned dwords here. We can either alias with
@flags and the first word of @lru, or the second word of @lru and @mapping,
or @index and @private. @flags is a non-starter. If we use @mapping,
then you have to set it to NULL before you free it, and I'm not sure
how easy that will be for you. If that's trivial, then we could use
the layout:
unsigned long _pp_flags;
unsigned long pp_magic;
union {
dma_addr_t dma_addr; /* might be one or two words */
unsigned long _pp_align[2];
};
unsigned long pp_pfmemalloc;
unsigned long xmi;