Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node

From: Daniel Jordan
Date: Tue Nov 27 2018 - 15:24:31 EST


On Tue, Nov 27, 2018 at 12:12:28AM +0000, Elliott, Robert (Persistent Memory) wrote:
> I ran a short test with:
> * HPE ProLiant DL360 Gen9 system
> * Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and
> 18 hyperthreaded cores (36-53)
> * DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds)
> * fio workload generator
> * cores on one CPU socket talking to a pmem device on the same CPU
> * large (1 MiB) random writes (to minimize the threads getting CPU cache
> hits from each other)
>
> Results:
> * 31.7 GB/s four threads, four physical cores (0,1,2,3)
> * 22.2 GB/s four threads, two physical cores (0,1,36,37)
> * 21.4 GB/s two threads, two physical cores (0,1)
> * 12.1 GB/s two threads, one physical core (0,36)
> * 11.2 GB/s one thread, one physical core (0)
>
> So, I think it's important that the initialization threads run on
> separate physical cores.

Thanks for running this. And fair enough, in this test using both siblings
gives only a 4-8% speedup over one, so it makes sense to use only cores in the
calculation.

As for how to actually do this, some arches have smp_num_siblings, but there
should be a generic interface to provide that.

It's also possible to calculate this from the existing
topology_sibling_cpumask, but the first option is better IMHO. Open to
suggestions.

> For the number of cores to use, one approach is:
> memory bandwidth (number of interleaved channels * speed)
> divided by
> CPU core max sustained write bandwidth
>
> For example, this 2133 MT/s system is roughly:
> 68 GB/s (4 * 17 GB/s nominal)
> divided by
> 11.2 GB/s (one core's performance)
> which is
> 6 cores
>
> ACPI HMAT will report that 68 GB/s number. I'm not sure of
> a good way to discover the 11.2 GB/s number.

Yes, this would be nice to do if we could know the per-core number, with the
caveat that a single number like this would be most useful for the CPU-memory
pair it was calculated for, so the kernel could at least calculate it for jobs
operating on local memory.

Some BogoMIPS-like calibration may work, but I'll wait for ACPI HMAT support in
the kernel.