Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

From: Alexander Duyck
Date: Thu Feb 14 2019 - 10:37:36 EST


On Thu, Feb 14, 2019 at 7:13 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
>
> On Wed, Feb 13, 2019 at 11:42 PM Alexander Duyck
> <alexander.duyck@xxxxxxxxx> wrote:
> > On Wed, Feb 13, 2019 at 12:42 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> > > The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> > > number of references that we might need to create in the fastpath later,
> > > the bump-allocation fastpath only has to modify the non-atomic bias value
> > > that tracks the number of extra references we hold instead of the atomic
> > > refcount. The maximum number of allocations we can serve (under the
> > > assumption that no allocation is made with size 0) is nc->size, so that's
> > > the bias used.
> > >
> > > However, even when all memory in the allocation has been given away, a
> > > reference to the page is still held; and in the `offset < 0` slowpath, the
> > > page may be reused if everyone else has dropped their references.
> > > This means that the necessary number of references is actually
> > > `nc->size+1`.
> > >
> > > Luckily, from a quick grep, it looks like the only path that can call
> > > page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> > > requires CAP_NET_ADMIN in the init namespace and is only intended to be
> > > used for kernel testing and fuzzing.
> >
> > Actually that has me somewhat concerned. I wouldn't be surprised if
> > most drivers expect the netdev_alloc_frags call to at least output an
> > SKB_DATA_ALIGN sized value.
> >
> > We probably should update __netdev_alloc_frag and __napi_alloc_frag so
> > that they will pass fragsz through SKB_DATA_ALIGN.
>
> Do you want to do a separate patch for that? I'd like to not mix
> logically separate changes in a single patch, and I also don't have a
> good understanding of the alignment concerns here.

You could just include it as a separate patch with your work.
Otherwise I will get to it when I have time.

The point is the issue you pointed out will actually cause other
issues if the behavior is maintained since you shouldn't be getting
unaligned blocks out of the frags API anyway.

> > > To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> > > `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> > > writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> > > with a vector consisting of 15 elements containing 1 byte each.
> > >
> > > Cc: stable@xxxxxxxxxxxxxxx
> > > Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
> > > ---
> > > mm/page_alloc.c | 8 ++++----
> > > 1 file changed, 4 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 35fdde041f5c..46285d28e43b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> > > /* Even if we own the page, we do not use atomic_set().
> > > * This would break get_page_unless_zero() users.
> > > */
> > > - page_ref_add(page, size - 1);
> > > + page_ref_add(page, size);
> > >
> > > /* reset page count bias and offset to start of new frag */
> > > nc->pfmemalloc = page_is_pfmemalloc(page);
> > > - nc->pagecnt_bias = size;
> > > + nc->pagecnt_bias = size + 1;
> > > nc->offset = size;
> > > }
> > >
> > > @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
> > > size = nc->size;
> > > #endif
> > > /* OK, page count is 0, we can safely set it */
> > > - set_page_count(page, size);
> > > + set_page_count(page, size + 1);
> > >
> > > /* reset page count bias and offset to start of new frag */
> > > - nc->pagecnt_bias = size;
> > > + nc->pagecnt_bias = size + 1;
> > > offset = size - fragsz;
> > > }
> >
> > If we already have to add a constant it might be better to just use
> > PAGE_FRAG_CACHE_MAX_SIZE + 1 in all these spots where you are having
> > to use "size + 1" instead of "size". That way we can avoid having to
> > add a constant to a register value and then program that value.
> > instead we can just assign the constant value right from the start.
>
> I doubt that these few instructions make a difference, but sure, I can
> send a v2 with that changed.

You would be surprised. They all end up adding up over time.