Re: [RFC PATCH 0/6] Drain remote per-cpu directly v2

From: Minchan Kim
Date: Wed May 11 2022 - 13:20:34 EST


On Wed, May 11, 2022 at 01:47:00PM +0100, Mel Gorman wrote:
> On Tue, May 10, 2022 at 11:13:05AM -0700, Minchan Kim wrote:
> > > Yes, but as reclaim is not fundamentally altered the main difference
> > > in behavious is that work is done inline instead of being deferred to a
> > > workqueue. That means in some cases, system CPU usage of a task will be
> > > higher because it's paying the cost directly.
> >
> > Sure but the reclaim path is already expensive so I doubt we could
> > see the sizable measurement on the system CPU usage.
> >
>
> It would be difficult to distinguish from the noise.
>
> > What I wanted to see was whether we have regression due to adding
> > spin_lock/unlock instructions in hot path. Due to squeeze it to
> > a cacheline, I expected the regression would be just marginal.
> >
>
> Ah, yes, I did test for this. page-fault-test hits the relevant paths
> very heavily and did show minor differences.
>
> 5.18.0-rc1 5.18.0-rc1
> vanilla mm-pcpdrain-v2r1
> Hmean faults/sec-1 886331.5718 ( 0.00%) 885462.7479 ( -0.10%)
> Hmean faults/sec-3 2337706.1583 ( 0.00%) 2332130.4909 * -0.24%*
> Hmean faults/sec-5 2851594.2897 ( 0.00%) 2844123.9307 ( -0.26%)
> Hmean faults/sec-7 3543251.5507 ( 0.00%) 3516889.0442 * -0.74%*
> Hmean faults/sec-8 3947098.0024 ( 0.00%) 3916162.8476 * -0.78%*
> Stddev faults/sec-1 2302.9105 ( 0.00%) 2065.0845 ( 10.33%)
> Stddev faults/sec-3 7275.2442 ( 0.00%) 6033.2620 ( 17.07%)
> Stddev faults/sec-5 24726.0328 ( 0.00%) 12525.1026 ( 49.34%)
> Stddev faults/sec-7 9974.2542 ( 0.00%) 9543.9627 ( 4.31%)
> Stddev faults/sec-8 9468.0191 ( 0.00%) 7958.2607 ( 15.95%)
> CoeffVar faults/sec-1 0.2598 ( 0.00%) 0.2332 ( 10.24%)
> CoeffVar faults/sec-3 0.3112 ( 0.00%) 0.2587 ( 16.87%)
> CoeffVar faults/sec-5 0.8670 ( 0.00%) 0.4404 ( 49.21%)
> CoeffVar faults/sec-7 0.2815 ( 0.00%) 0.2714 ( 3.60%)
> CoeffVar faults/sec-8 0.2399 ( 0.00%) 0.2032 ( 15.28%)
>
> There is a small hit in the number of faults per second but it's within
> the noise and the results are more stable with the series so I'd mark it
> down as a small but potentially measurable impact.

Thanks for sharing. It would be great to have in the description, too.

>
> > >
> > > The workloads I used just hit reclaim directly to make sure it's
> > > functionally not broken. There is no change in page aging decisions,
> > > only timing of drains. I didn't check interference of a heavy workload
> > > interfering with a CPU-bound workload running on NOHZ CPUs as I assumed
> > > both you and Nicolas had a test case ready to use.
> >
> > The my workload is not NOHZ CPUs but run apps under heavy memory
> > pressure so they goes to direct reclaim and be stuck on drain_all_pages
> > until work on workqueue run.
> >
> > unit: nanosecond
> > max(dur) avg(dur) count(dur)
> > 166713013 487511.77786438033 1283
> >
> > From traces, system encountered the drain_all_pages 1283 times and
> > worst case was 166ms and avg was 487us.
> >
> > The other problem was alloc_contig_range in CMA. The PCP draining
> > takes several hundred millisecond sometimes though there is no
> > memory pressure or a few of pages to be migrated out but CPU were
> > fully booked.
> >
> > Your patch perfectly removed those wasted time.
> >
>
> Those stalls are painful and it's a direct impact where a workload does
> not make progress. The NOHZ stall is different in that it's worried
> about interference. Both problems should have the same solution.
>
> Do you mind if I quote these paragraphs in the leader to v3?

Please have it in the description.

>
> > > Which ones are of concern?
> > >
> > > Some of the page->lru references I left alone in the init paths simply
> > > because in those contexts, the page wasn't on a buddy or PCP list. In
> > > free_unref_page_list the page is not on the LRU, it's just been isolated
> > > from the LRU. In alloc_pages_bulk, it's not on a buddy, pcp or LRU list
> > > and is just a list placeholder so I left it alone. In
> > > free_tail_pages_check the context was a page that was likely previously
> > > on a LRU.
> >
> > Just nits: all are list macros.
> >
> > free_pcppages_bulk's list_last_entry should be pcp_list.
> >
> > mark_free_pages's list_for_each_entry should be buddy_list
> >
> > __rmqueue_pcplist's list_first_enty should be pcp_list.
> >
>
> Ah, you're completely correct.
>
> > >
> > > > since I have
> > > > tested these patchset in my workload and didn't spot any other
> > > > problems.
> > > >
> > >
> > > Can you describe this workload, is it available anywhere and does it
> > > require Android to execute?
> >
> > I wrote down above. It runs on Android but I don't think it's
> > android specific issue but anyone could see such a long latency
> > from PCP draining once one of cores are monopolized by higher
> > priority processes or too many pending kworks.
> >
>
> Yeah, I agree it's not an Android-specific problem. It could be detected by
> tracing the time spent in drain_all_pages for any arbitrary workload. The
> BCC funclatency tool could measure it.
>
> > >
> > > If you have positive results, it would be appreciated if you could post
> > > them or just note in a Tested-by/Acked-by that it had a measurable impact
> > > on the reclaim/cma path.
> >
> > Sure.
> >
> > All patches in this series.
> >
> > Tested-by: Minchan Kim <minchan@xxxxxxxxxx>
> > Acked-by: Minchan Kim <minchan@xxxxxxxxxx>
> >
>
> Thanks, I've added that to all the patches. I'll wait another day for
> more feedback before sending out a v3. The following is the diff between
> v2 and v3 based on your feedback.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4ac39d30ec8f..0f5a6a5b0302 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1497,7 +1497,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> do {
> int mt;
>
> - page = list_last_entry(list, struct page, lru);
> + page = list_last_entry(list, struct page, pcp_list);
> mt = get_pcppage_migratetype(page);
>
> /* must delete to avoid corrupting pcp list */
> @@ -3276,7 +3276,7 @@ void mark_free_pages(struct zone *zone)
>
> for_each_migratetype_order(order, t) {
> list_for_each_entry(page,
> - &zone->free_area[order].free_list[t], lru) {
> + &zone->free_area[order].free_list[t], buddy_list) {
> unsigned long i;
>
> pfn = page_to_pfn(page);
> @@ -3761,7 +3761,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> }
> }
>
> - page = list_first_entry(list, struct page, lru);
> + page = list_first_entry(list, struct page, pcp_list);
> list_del(&page->pcp_list);
> pcp->count -= 1 << order;
> } while (check_new_pcp(page, order));

Looks good to me. Please stick the my Tested-by/Acked-by.

Thanks, Mel.