Re: [PATCH] mm/vmstat: Defer the refresh_zone_stat_thresholds after all CPUs bringup

From: Andrew Morton
Date: Fri Jul 05 2024 - 16:59:26 EST

On Fri, 5 Jul 2024 01:48:21 -0700 Saurabh Sengar <ssengar@xxxxxxxxxxxxxxxxxxx> wrote:

> refresh_zone_stat_thresholds function has two loops which is expensive for
> higher number of CPUs and NUMA nodes.
> Below is the rough estimation of total iterations done by these loops
> based on number of NUMA and CPUs.
> Total number of iterations: nCPU * 2 * Numa * mCPU
> Where:
> nCPU = total number of CPUs
> Numa = total number of NUMA nodes
> mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
> For the system under test with 16 NUMA nodes and 1024 CPUs, this
> results in a substantial increase in the number of loop iterations
> during boot-up when NUMA is enabled:
> No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds
> takes around 224 ms total for all the CPUs in the system under test.
> 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds
> takes around 4.5 seconds total for all the CPUs in the system under test.

Did you measure the overall before-and-after times? IOW, how much of
that 4.5s do we reclaim?

> Calling this for each CPU is expensive when there are large number
> of CPUs along with multiple NUMAs. Fix this by deferring
> refresh_zone_stat_thresholds to be called later at once when all the
> secondary CPUs are up. Also, register the DYN hooks to keep the
> existing hotplug functionality intact.

Seems risky - we'll now have online CPUs which have unintialized data,
yes? What assurance do we have that this data won't be accessed?

Another approach might be to make the code a bit smarter - instead of
calculating thresholds for the whole world, we make incremental changes
to the existing thresholds on behalf of the new resource which just
became available?