Re: [PATCH v2 for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

From: Joonsoo Kim
Date: Mon Sep 28 2020 - 21:28:19 EST


2020년 9월 29일 (화) 오전 8:52, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>님이 작성:
>
> On Mon, 28 Sep 2020 17:50:46 +0900 js1304@xxxxxxxxx wrote:
>
> > From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> >
> > memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> > on CMA area, but, there is a missing case and the page on CMA area could
> > be allocated even if APIs are used. This patch handles this case to fix
> > the potential issue.
> >
> > Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> > could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> > specified.
>
> Changelog doesn't describe the end-user visible effects of the bug.
> Please send this description?

How about this one?

memalloc_nocma_{save/restore} APIs can be used to skip page allocation
on CMA area, but, there is a missing case and the page on CMA area could
be allocated even if APIs are used. This patch handles this case to fix
the potential issue.

For now, these APIs are used to prevent long-term pinning on the CMA page.
When the long-term pinning is requested on the CMA page, it is migrated to
the non-CMA page before pinning. This non-CMA page is allocated by using
memalloc_nocma_{save/restore} APIs. If APIs doesn't work as intended,
the CMA page is allocated and it is pinned for a long time. This long-term pin
for the CMA page causes cma_alloc() failure and it could result in wrong
behaviour on the device driver who uses the cma_alloc().

Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
specified.

> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3367,9 +3367,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> > struct page *page;
> >
> > if (likely(order == 0)) {
> > - page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
> > + /*
> > + * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> > + * we need to skip it when CMA area isn't allowed.
> > + */
> > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > + migratetype != MIGRATE_MOVABLE) {
> > + page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
> > migratetype, alloc_flags);
> > - goto out;
> > + goto out;
> > + }
> > }
> >
> > /*
>
> We still really really don't want to be adding overhead to the page
> allocation hotpath for a really really obscure feature from a single
> callsite.

In fact, if we clear the __GFP_MOVABLE flag when initializing the
allocation context, we can avoid CMA page allocation without
adding this overhead to the page allocation hotpath. But, I think
this is not a good practice since it cheats the allocation type. There
would be a side-effect, for example, we cannot use the memory on
the ZONE_MOVABLE in this case.

> Do we have an understanding of how many people's kernels are enabling
> CONFIG_CMA?

AFAIK, the almost embedded system enables CONFIG_CMA. For example,
the handset, TV, AVN in a car. Recently, Roman makes CMA usable for huge
page allocation so users are continuously increased.

> I previously suggested retrying the allocation in
> __gup_longterm_locked() but you said "it cannot ensure that we
> eventually get the non-CMA page". Please explain why?

To avoid allocating a CMA page, the pcplist should be empty. Retry doesn't
guarantee that we will hit the case that the pcplist is empty. It increases
the probability for this case, but, I think that relying on the
probability is not
a good practice.

> What about manually emptying the pcplists beforehand?

It also increases the probability. schedule() or interrupt after emptying but
before the allocation could invalidate the effect.

> Or byassing the pcplists for this caller and calling __rmqueue() directly?

What this patch does is this one.

> > @@ -3381,7 +3388,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> >
> > do {
> > page = NULL;
> > - if (alloc_flags & ALLOC_HARDER) {
> > + if (order > 0 && alloc_flags & ALLOC_HARDER) {
> > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > if (page)
> > trace_mm_page_alloc_zone_locked(page, order, migratetype);
>
> What does this hunk do?

MIGRATE_HIGHATOMIC is a reserved area for high order atomic allocation.
This hunk makes that order-0 allocation skip this area.

Thanks.