Re: [PATCH v2] mm: page_alloc: dump migrate-failed pages

From: Minchan Kim
Date: Mon Mar 08 2021 - 21:21:54 EST


On Mon, Mar 08, 2021 at 04:21:28PM -0800, Andrew Morton wrote:
> On Mon, 8 Mar 2021 12:20:47 -0800 Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> > alloc_contig_range is usually used on cma area or movable zone.
> > It's critical if the page migration fails on those areas so
> > dump more debugging message.
> >
> > page refcount, mapcount with page flags on dump_page are
> > helpful information to deduce the culprit. Furthermore,
> > dump_page_owner was super helpful to find long term pinner
> > who initiated the page allocation.
> >
> > Admin could enable the dump like this(by default, disabled)
> >
> > echo "func dump_migrate_failure_pages +p" > control
> >
> > Admin could disable it.
> >
> > echo "func dump_migrate_failure_pages =_" > control
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -8453,6 +8453,34 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
> > pageblock_nr_pages));
> > }
> >
> > +#if defined(CONFIG_DYNAMIC_DEBUG) || \
> > + (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
> > +static DEFINE_RATELIMIT_STATE(alloc_contig_ratelimit_state,
> > + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> > +int alloc_contig_ratelimit(void)
> > +{
> > + return __ratelimit(&alloc_contig_ratelimit_state);
> > +}
>
> Wow, that's an eyesore. We're missing helpers in the ratelimit code.
> Can we do something like
>
> /* description goes here */
> #define RATELIMIT2(interval, burst)
> ({
> static DEFINE_RATELIMIT_STATE(_rs, interval, burst);
>
> __ratelimit(_rs);
> })
>
> #define RATELIMIT()
> RATELIMIT2(DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST)
>
> > +void dump_migrate_failure_pages(struct list_head *page_list)
> > +{
> > + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor,
> > + "migrate failure");
> > + if (DYNAMIC_DEBUG_BRANCH(descriptor) &&
> > + alloc_contig_ratelimit()) {
> > + struct page *page;
> > +
> > + WARN(1, "failed callstack");
> > + list_for_each_entry(page, page_list, lru)
> > + dump_page(page, "migration failure");
> > + }
> > +}
>
> Then we can simply do
>
> if (DYNAMIC_DEBUG_BRANCH(descriptor) && RATELIMIT())

Sounds good idea to me. There are many places to take the
benefit. However, let me leave it until we could discuss
this patch. We could clean them up as follow patch.

Thank you.