Re: [PATCH v4 01/12] mm/vmstat: remove remote node draining

From: Marcelo Tosatti
Date: Mon Mar 13 2023 - 12:42:18 EST


On Thu, Mar 09, 2023 at 06:11:30PM +0100, Vlastimil Babka wrote:
> On 3/5/23 14:36, Marcelo Tosatti wrote:
> > Draining of pages from the local pcp for a remote zone should not be
> > necessary, since once the system is low on memory (or compaction on a
> > zone is in effect), drain_all_pages should be called freeing any unused
> > pcps.
>
> Hmm I can't easily say this is a good idea without a proper evaluation. It's
> true that drain_all_pages() will be called from
> __alloc_pages_direct_reclaim(), but if you look closely, it's only after
> __perform_reclaim() and subsequent get_page_from_freelist() failure, which
> means very low on memory. There's also kcompactd_do_work() caller, but that
> shouldn't respond to order-0 pressure.
>
> Maybe part of the problem is that pages sitting in pcplist are not counted
> as NR_FREE_PAGES, so watermark checks will not "see" them as available. That
> would be true for both local and remote nodes as we process the zonelist,
> but if pages are stuck on remote pcplists, it could make the problem worse
> than if they stayed only in local ones.
>
> But the good news should be that for this series you shouldn't actually need
> this patch, AFAIU. Last year Mel did the "Drain remote per-cpu directly"
> series and followups, so remote draining of pcplists is today done without
> sending an IPI to the remote CPU.

Vlastimil,

Sent -v5 which should address the concerns regarding pages stuck on
remote pcplists. Let me know if you have other concerns.

> BTW I kinda like the implementation that relies on "per-cpu spin lock"
> that's used mostly locally thus uncontended, and only rarely remotely. Much
> simpler than complicated cmpxchg schemes, wonder if those are really still
> such a win these days, e.g. in SLUB or here in vmstat...

Good question (to which i don't know the answer).

Thanks.

> > For reference, the original commit which introduces remote node
> > draining is 4037d452202e34214e8a939fa5621b2b3bbb45b7.
> >
> > Acked-by: David Hildenbrand <david@xxxxxxxxxx>
> > Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
> >
> > Index: linux-vmstat-remote/include/linux/mmzone.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/include/linux/mmzone.h
> > +++ linux-vmstat-remote/include/linux/mmzone.h
> > @@ -679,9 +679,6 @@ struct per_cpu_pages {
> > int high; /* high watermark, emptying needed */
> > int batch; /* chunk size for buddy add/remove */
> > short free_factor; /* batch scaling factor during free */
> > -#ifdef CONFIG_NUMA
> > - short expire; /* When 0, remote pagesets are drained */
> > -#endif
> >
> > /* Lists of pages, one per migrate type stored on the pcp-lists */
> > struct list_head lists[NR_PCP_LISTS];
> > Index: linux-vmstat-remote/mm/vmstat.c
> > ===================================================================
> > --- linux-vmstat-remote.orig/mm/vmstat.c
> > +++ linux-vmstat-remote/mm/vmstat.c
> > @@ -803,20 +803,16 @@ static int fold_diff(int *zone_diff, int
> > *
> > * The function returns the number of global counters updated.
> > */
> > -static int refresh_cpu_vm_stats(bool do_pagesets)
> > +static int refresh_cpu_vm_stats(void)
> > {
> > struct pglist_data *pgdat;
> > struct zone *zone;
> > int i;
> > int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
> > int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
> > - int changes = 0;
> >
> > for_each_populated_zone(zone) {
> > struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
> > -#ifdef CONFIG_NUMA
> > - struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
> > -#endif
> >
> > for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
> > int v;
> > @@ -826,44 +822,8 @@ static int refresh_cpu_vm_stats(bool do_
> >
> > atomic_long_add(v, &zone->vm_stat[i]);
> > global_zone_diff[i] += v;
> > -#ifdef CONFIG_NUMA
> > - /* 3 seconds idle till flush */
> > - __this_cpu_write(pcp->expire, 3);
> > -#endif
> > }
> > }
> > -#ifdef CONFIG_NUMA
> > -
> > - if (do_pagesets) {
> > - cond_resched();
> > - /*
> > - * Deal with draining the remote pageset of this
> > - * processor
> > - *
> > - * Check if there are pages remaining in this pageset
> > - * if not then there is nothing to expire.
> > - */
> > - if (!__this_cpu_read(pcp->expire) ||
> > - !__this_cpu_read(pcp->count))
> > - continue;
> > -
> > - /*
> > - * We never drain zones local to this processor.
> > - */
> > - if (zone_to_nid(zone) == numa_node_id()) {
> > - __this_cpu_write(pcp->expire, 0);
> > - continue;
> > - }
> > -
> > - if (__this_cpu_dec_return(pcp->expire))
> > - continue;
> > -
> > - if (__this_cpu_read(pcp->count)) {
> > - drain_zone_pages(zone, this_cpu_ptr(pcp));
> > - changes++;
> > - }
> > - }
> > -#endif
> > }
> >
> > for_each_online_pgdat(pgdat) {
> > @@ -880,8 +840,7 @@ static int refresh_cpu_vm_stats(bool do_
> > }
> > }
> >
> > - changes += fold_diff(global_zone_diff, global_node_diff);
> > - return changes;
> > + return fold_diff(global_zone_diff, global_node_diff);
> > }
> >
> > /*
> > @@ -1867,7 +1826,7 @@ int sysctl_stat_interval __read_mostly =
> > #ifdef CONFIG_PROC_FS
> > static void refresh_vm_stats(struct work_struct *work)
> > {
> > - refresh_cpu_vm_stats(true);
> > + refresh_cpu_vm_stats();
> > }
> >
> > int vmstat_refresh(struct ctl_table *table, int write,
> > @@ -1877,6 +1836,8 @@ int vmstat_refresh(struct ctl_table *tab
> > int err;
> > int i;
> >
> > + drain_all_pages(NULL);
> > +
> > /*
> > * The regular update, every sysctl_stat_interval, may come later
> > * than expected: leaving a significant amount in per_cpu buckets.
> > @@ -1931,7 +1892,7 @@ int vmstat_refresh(struct ctl_table *tab
> >
> > static void vmstat_update(struct work_struct *w)
> > {
> > - if (refresh_cpu_vm_stats(true)) {
> > + if (refresh_cpu_vm_stats()) {
> > /*
> > * Counters were updated so we expect more updates
> > * to occur in the future. Keep on running the
> > @@ -1994,7 +1955,7 @@ void quiet_vmstat(void)
> > * it would be too expensive from this path.
> > * vmstat_shepherd will take care about that for us.
> > */
> > - refresh_cpu_vm_stats(false);
> > + refresh_cpu_vm_stats();
> > }
> >
> > /*
> > Index: linux-vmstat-remote/mm/page_alloc.c
> > ===================================================================
> > --- linux-vmstat-remote.orig/mm/page_alloc.c
> > +++ linux-vmstat-remote/mm/page_alloc.c
> > @@ -3176,26 +3176,6 @@ static int rmqueue_bulk(struct zone *zon
> > return allocated;
> > }
> >
> > -#ifdef CONFIG_NUMA
> > -/*
> > - * Called from the vmstat counter updater to drain pagesets of this
> > - * currently executing processor on remote nodes after they have
> > - * expired.
> > - */
> > -void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> > -{
> > - int to_drain, batch;
> > -
> > - batch = READ_ONCE(pcp->batch);
> > - to_drain = min(pcp->count, batch);
> > - if (to_drain > 0) {
> > - spin_lock(&pcp->lock);
> > - free_pcppages_bulk(zone, to_drain, pcp, 0);
> > - spin_unlock(&pcp->lock);
> > - }
> > -}
> > -#endif
> > -
> > /*
> > * Drain pcplists of the indicated processor and zone.
> > */
> >
> >
> >
>
>
>