Re: [PATCH] mm: be more verbose for alloc_contig_range faliures
From: Minchan Kim
Date: Thu Mar 04 2021 - 13:13:00 EST
On Thu, Mar 04, 2021 at 06:23:09PM +0100, David Hildenbrand wrote:
> > > You want to debug something, so you try triggering it and capturing debug
> > > data. There are not that many alloc_contig_range() users such that this
> > > would really be an issue to isolate ...
> >
> > cma_alloc uses alloc_contig_range and cma_alloc has lots of users.
> > Even, it is expoerted by dmabuf so any userspace would trigger the
> > allocation by their own. Some of them could be tolerant for the failure,
> > rest of them could be critical. We should't expect it by limited kernel
> > usecase.
>
> Assume you are debugging allocation failures. You either collect the data
> yourself or ask someone to send you that output. You care about any
> alloc_contig_range() allocation failures that shouldn't happen, don't you?
>
> >
> > >
> > > Strictly speaking: any allocation failure on ZONE_MOVABLE or CMA is
> > > problematic (putting aside NORETRY logic and similar aside). So any such
> > > page you hit is worth investigating and, therefore, worth getting logged for
> > > debugging purposes.
> >
> > If you believe the every alloc_contig_range failure is problematic
>
> Every one where we should have guarantees I guess: ZONE_MOVABLE or
> MIGRAT_CMA. On ZONE_NORMAL, there are no guarantees.
Indeed.
>
> > and there is no such realy example I menionted above in the world,
> > I am happy to put this chunk to support dynamic debugging.
> > Okay?
> >
> > +#if defined(CONFIG_DYNAMIC_DEBUG) || \
> > + (defined(CONFIG_DYNAMIC_DEBUG_CORE) && defined(DYNAMIC_DEBUG_MODULE))
> > +static DEFINE_RATELIMIT_STATE(alloc_contig_ratelimit_state,
> > + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> > +int alloc_contig_ratelimit(void)
> > +{
> > + return __ratelimit(&alloc_contig_ratelimit_state);
> > +}
> > +
>
> ^ do we need ratelimiting with dynamic debugging enabled?
Main argument was debug message flooding. Even though we
play with dynamic debugging, the issue never disappear.
>
> > +void dump_migrate_failure_pages(struct list_head *page_list)
> > +{
> > + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor,
> > + "migrate failure");
> > + if (DYNAMIC_DEBUG_BRANCH(descriptor) &&
> > + alloc_contig_ratelimit()) {
> > + struct page *page;
> > +
> > + WARN(1, "failed callstack");
> > + list_for_each_entry(page, page_list, lru)
> > + dump_page(page, "migration failure");
>
> Are all pages on the list guaranteed to be problematic, or only the first
> entry? I assume all.
All.
>
> > + }
> > +}
> > +#else
> > +static inline void dump_migrate_failure_pages(struct list_head *page_list)
> > +{
> > +}
> > +#endif
> > +
> > /* [start, end) must belong to a single zone. */
> > static int __alloc_contig_migrate_range(struct compact_control *cc,
> > unsigned long start, unsigned long end)
> > @@ -8496,6 +8522,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
> > NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE);
> > }
> > if (ret < 0) {
> > + dump_migrate_failure_pages(&cc->migratepages);
> > putback_movable_pages(&cc->migratepages);
> > return ret;
> > }
> >
> >
>
> If that's the way dynamic debugging is configured/enabled (still have to
> look into it) - yes, that goes into the right direction. As I said above,
> you should dump only where we have some kind of guarantees I assume.
Sure, let me wait for your review before sending next revision.
Thanks for the review!