Re: [mm/page_alloc] 7fef431be9: vm-scalability.throughput 87.8% improvement

From: David Hildenbrand
Date: Mon Oct 26 2020 - 15:14:26 EST



> Am 26.10.2020 um 19:11 schrieb Axel Rasmussen <axelrasmussen@xxxxxxxxxx>:
>
> On Mon, Oct 26, 2020 at 1:31 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>
>>> On 23.10.20 21:44, Axel Rasmussen wrote:
>>> On Fri, Oct 23, 2020 at 12:29 PM David Rientjes <rientjes@xxxxxxxxxx> wrote:
>>>>
>>>> On Wed, 21 Oct 2020, kernel test robot wrote:
>>>>
>>>>> Greeting,
>>>>>
>>>>> FYI, we noticed a 87.8% improvement of vm-scalability.throughput due to commit:
>>>>>
>>>>>
>>>>> commit: 7fef431be9c9ac255838a9578331567b9dba4477 ("mm/page_alloc: place pages to tail in __free_pages_core()")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>>
>>>>> in testcase: vm-scalability
>>>>> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
>>>>> with following parameters:
>>>>>
>>>>> runtime: 300s
>>>>> size: 512G
>>>>> test: anon-wx-rand-mt
>>>>> cpufreq_governor: performance
>>>>> ucode: 0x5002f01
>>>>>
>>>>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>>>>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>>>>
>>>>
>>>> I'm curious why we are not able to reproduce this improvement on Skylake
>>>> and actually see a slight performance degradation, at least for
>>>> 300s_128G_truncate_throughput.
>>>>
>>>> Axel Rasmussen <axelrasmussen@xxxxxxxxxx> can provide more details on our
>>>> results.
>>>
>>> Right, our results show a slight regression on a Skylake machine [1],
>>> and a slight performance increase on a Rome machine [2]. For these
>>> tests, I used Linus' v5.9 tag as a baseline, and then applied this
>>> patchset onto that tag as a test kernel (the patches applied cleanly
>>> besides one comment, I didn't have to do any code fixups). This is
>>> running the same anon-wx-rand-mt test defined in the upstream
>>> lkp-tests job file:
>>> https://github.com/intel/lkp-tests/blob/master/jobs/vm-scalability.yaml
>>
>> Hi,
>>
>> looking at the yaml, am I right that each test is run after a fresh boot?
>
> Yes-ish. For the results I posted, the larger context would have been
> something like:
>
> - Kernel installed, machine freshly rebooted.
> - Various machine management daemons start by default, some are
> stopped so as not to interfere with the test.
> - Some packages are installed on the machine (the thing which
> orchestrates the testing in particular).
> - The test is run.
>
> So, the machine is somewhat fresh in the sense that it hasn't been
> e.g. serving production traffic just before running the test, but it's
> also not as clean as it could be. It seems plausible this difference
> explains the difference in the results (I'm not too familiar with how
> the Intel kernel test robot is implemented).

Ah, okay. So most memory in the system is indeed untouched. Thanks!


>
>>
>> As I already replied to David, this patch merely changes the initial
>> order of the freelists. The general end result is that lower memory
>> addresses will be allocated before higher memory addresses will be
>> allocated - within a zone, the first time memory is getting allocated.
>> Before, it was the other way around. Once a system ran for some time,
>> freelists are randomized.
>>
>> There might be benchmarks/systems where this initial system state might
>> now be better suited - or worse. It doesn't really tell you that core-mm
>> is behaving better/worse now - it merely means that the initial system
>> state under which the benchmark was started affected the benchmark.
>>
>> Looks like so far there is one benchmark+system where it's really
>> beneficial, there is one benchmark+system where it's slightly
>> beneficial, and one benchmark+system where there is a slight regression.
>>
>>
>> Something like the following would revert to the previous behavior:
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 23f5066bd4a5..fac82420cc3d 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1553,7 +1553,9 @@ void __free_pages_core(struct page *page, unsigned
>> int order)
>> * Bypass PCP and place fresh pages right to the tail, primarily
>> * relevant for memory onlining.
>> */
>> - __free_pages_ok(page, order, FPI_TO_TAIL);
>> + __free_pages_ok(page, order,
>> + system_state < SYSTEM_RUNNING ? FPI_NONE :
>> + FPI_TO_TAIL);
>> }
>>
>> #ifdef CONFIG_NEED_MULTIPLE_NODES
>>
>>
>> (Or better, passing the expected behavior via MEMINIT_EARLY/... to
>> __free_pages_core().)
>>
>>
>> But then, I am not convinced we should perform that change: having a
>> clean (initial) state might be true for these benchmarks, but it's far
>> from reality. The change in numbers doesn't show you that core-mm is
>> operating better/worse, just that the baseline for you tests changed due
>> to a changed initial system state.
>
> Not to put words in David's mouth :) but at least from my perspective,
> our original interest was "wow, an 87% improvement! maybe we should
> deploy this patch to production!", and I'm mostly sharing my results
> just to say "it actually doesn't seem to be a huge *general*
> improvement", rather than to advocate for further changes / fixes.

Ah, yes, I saw the +87% and thought „that can‘t be right“.

> IIUC the original motivation for this patch was to fix somewhat of an
> edge case, not to make a very general improvement, so this seems fine.
>

Exactly.