Re: Potential Regression in futex Performance from v6.9 to v6.10-rc1 and v6.11-rc4

From: David Hildenbrand
Date: Wed Sep 04 2024 - 09:47:58 EST


On 04.09.24 12:05, Anders Roxell wrote:
On Tue, 3 Sept 2024 at 14:37, David Hildenbrand <david@xxxxxxxxxx> wrote:

On 03.09.24 14:21, Anders Roxell wrote:
Hi,

I've noticed that the futex01-thread-* tests in will-it-scale-sys-threads
are running about 2% slower on v6.10-rc1 compared to v6.9, and this
slowdown continues with v6.11-rc4. I am focused on identifying any
performance regressions greater than 2% that occur in automated
testing on arm64 HW.

Using git bisect, I traced the issue to commit
f002882ca369 ("mm: merge folio_is_secretmem() and
folio_fast_pin_allowed() into gup_fast_folio_allowed()").

Thanks for analyzing the (slight) regression!


My tests were performed on m7g.large and m7g.metal instances:

* The slowdown is consistent regardless of the number of threads;
futex1-threads-128 performs similarly to futex1-threads-2, indicating
there is no scalability issue, just a minor performance overhead.
* The test doesn’t involve actual futex operations, just dummy wake/wait
on a variable that isn’t accessed by other threads, so the results might
not be very significant.

Given that this seems to be a minor increase in code path length rather
than a scalability issue, would this be considered a genuine regression?

Likely not, I've seen these kinds of regressions (for example in my fork
micro-benchmarks) simply because the compiler slightly changes the code
layout, or suddenly decides to not inline a functions.

Still it is rather unexpected, so let's find out what's happening.

My first intuition would have been that the compiler now decides to not
inline gup_fast_folio_allowed() anymore, adding a function call.

LLVM seems to inline it for me. GCC not.

Would this return the original behavior for you?

David thank you for quick patch for me to try.

This patch helped the original regression on v6.10-rc1, but on current mainline
v6.11-rc6 the patch does nothing and the performance is as expeced.

Just so I understand this correctly:

It fixed itself after v6.11-rc4, but v6.11-rc4 was fixed with my patch?

If that's the case, then it's really the compiler deciding whether to inline or not, and on v6.11-rc6 it decides to inline again.

--
Cheers,

David / dhildenb