Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64

From: David Hildenbrand
Date: Wed Oct 16 2024 - 11:17:19 EST

Next message: Matthieu Baerts: "Re: [PATCH rcu] configs/debug: make sure PROVE_RCU_LIST=y takes effect"
Previous message: Yonatan Maman: "Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages"
In reply to: Ryan Roberts: "Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64"
Next in thread: Ryan Roberts: "Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Performance Testing
===================

I've run some limited performance benchmarks:

First, a real-world benchmark that causes a lot of page table manipulation (and
therefore we would expect to see regression here if we are going to see it
anywhere); kernel compilation. It barely registers a change. Values are times,
so smaller is better. All relative to base-4k:

| | kern | kern | user | user | real | real |
| config | mean | stdev | mean | stdev | mean | stdev |
|-------------|---------|---------|---------|---------|---------|---------|
| base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% |
| compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% |
| boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% |

The Speedometer JavaScript benchmark also shows no change. Values are runs per
min, so bigger is better. All relative to base-4k:

| config | mean | stdev |
|-------------|---------|---------|
| base-4k | 0.0% | 0.8% |
| compile-4k | 0.4% | 0.8% |
| boot-4k | 0.0% | 0.9% |

Finally, I've run some microbenchmarks known to stress page table manipulations
(originally from David Hildenbrand). The fork test maps/allocs 1G of anon
memory, then measures the cost of fork(). The munmap test maps/allocs 1G of anon
memory then measures the cost of munmap()ing it. The fork test is known to be
extremely sensitive to any changes that cause instructions to be aligned
differently in cachelines. When using this test for other changes, I've seen
double digit regressions for the slightest thing, so 12% regression on this test
is actually fairly good. This likely represents the extreme worst case for
regressions that will be observed across other microbenchmarks (famous last
words). Values are times, so smaller is better. All relative to base-4k:

... and here I am, worrying about much smaller degradation in these micro-benchmark ;) You're right, these are pure micro-benchmarks, and while 12% does sound like "much", even stupid compiler code movement can result in such changes in the fork() micro benchmark.

So I think this is just fine, and actually "surprisingly" small. And, there is even a way to statically compile a page size and not worry about that at all.

As discussed ahead of times, I consider this change very valuable. In RHEL, the biggest issue is actually the test matrix, that cannot really be reduced significantly ... but it will make shipping/packaging easier.

CCing Don, who did the separate 64k RHEL flavor kernel.

--
Cheers,

David / dhildenb

Next message: Matthieu Baerts: "Re: [PATCH rcu] configs/debug: make sure PROVE_RCU_LIST=y takes effect"
Previous message: Yonatan Maman: "Re: [PATCH v1 0/4] GPU Direct RDMA (P2P DMA) for Device Private Pages"
In reply to: Ryan Roberts: "Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64"
Next in thread: Ryan Roberts: "Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]