Re: [PATCH 4/4] mm: numa: Slow PTE scan rate if migration failures occur

From: Ingo Molnar
Date: Sun Mar 08 2015 - 05:54:26 EST



* Mel Gorman <mgorman@xxxxxxx> wrote:

> Elapsed time is primarily worse on one benchmark -- numa01 which is
> an adverse workload. The user time differences are also dominated by
> that benchmark
>
> 4.0.0-rc1 4.0.0-rc1 3.19.0
> vanilla slowscan-v2r7 vanilla
> Time User-NUMA01 32883.59 ( 0.00%) 35288.00 ( -7.31%) 25695.96 ( 21.86%)
> Time User-NUMA01_THEADLOCAL 17453.20 ( 0.00%) 17765.79 ( -1.79%) 17404.36 ( 0.28%)
> Time User-NUMA02 2063.70 ( 0.00%) 2063.22 ( 0.02%) 2037.65 ( 1.26%)
> Time User-NUMA02_SMT 983.70 ( 0.00%) 976.01 ( 0.78%) 981.02 ( 0.27%)

But even for 'numa02', the simplest of the workloads, there appears to
be some of a regression relative to v3.19, which ought to be beyond
the noise of the measurement (which would be below 1% I suspect), and
as such relevant, right?

And the XFS numbers still show significant regression compared to
v3.19 - and that cannot be ignored as artificial, 'adversarial'
workload, right?

For example, from your numbers:

xfsrepair
4.0.0-rc1 4.0.0-rc1 3.19.0
vanilla slowscan-v2 vanilla
...
Amean real-xfsrepair 507.85 ( 0.00%) 459.58 ( 9.50%) 447.66 ( 11.85%)
Amean syst-xfsrepair 519.88 ( 0.00%) 281.63 ( 45.83%) 202.93 ( 60.97%)

if I interpret the numbers correctly, it shows that compared to v3.19,
system time increased by 38% - which is rather significant!

> > So what worries me is that Dave bisected the regression to:
> >
> > 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")
> >
> > And clearly your patch #4 just tunes balancing/migration intensity
> > - is that a workaround for the real problem/bug?
>
> The patch makes NUMA hinting faults use standard page table handling
> routines and protections to trap the faults. Fundamentally it's
> safer even though it appears to cause more traps to be handled. I've
> been assuming this is related to the different permissions PTEs get
> and when they are visible on all CPUs. This path is addressing the
> symptom that more faults are being handled and that it needs to be
> less aggressive.

But the whole cleanup ought to have been close to an identity
transformation from the CPU's point of view - and your measurements
seem to confirm Dave's findings.

And your measurement was on bare metal, while Dave's is on a VM, and
both show a significant slowdown on the xfs tests even with your
slow-tuning patch applied, so it's unlikely to be a measurement fluke
or some weird platform property.

> I've gone through that patch and didn't spot anything else that is
> doing wrong that is not already handled in this series. Did you spot
> anything obviously wrong in that patch that isn't addressed in this
> series?

I didn't spot anything wrong, but is that a basis to go forward and
work around the regression, in a way that doesn't even recover lost
performance?

> > And the patch Dave bisected to is a relatively simple patch. Why
> > not simply revert it to see whether that cures much of the
> > problem?
>
> Because it also means reverting all the PROT_NONE handling and going
> back to _PAGE_NUMA tricks which I expect would be naked by Linus.

Yeah, I realize that (and obviously I support the PROT_NONE direction
that Peter Zijlstra prototyped with the original sched/numa series),
but can we leave this much of a regression on the table?

I hate to be such a pain in the neck, but especially the 'down tuning'
of the scanning intensity will make an apples to apples comparison
harder!

I'd rather not do the slow-tuning part and leave sucky performance in
place for now and have an easy method plus the motivation to find and
fix the real cause of the regression, than to partially hide it this
way ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/