Re: Re: [PATCH v2 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ

From: Bohdan Trach

Date: Fri May 29 2026 - 11:58:14 EST

On Wed, 27 May 2026 19:46:43, Andreas Dilger wrote:
> Thanks for the patch. Definitely the benchmarks in the 0/2 email show
> significant gains for the Kunpeng system, and reducing contention makes sense
> as core counts increase and the likely case is that the bit is already set.
>
> That said, I wonder if this should (also/instead) be put into
> test_and_set_bit() itself, or add test_and_unlikely_set_bit() or
> test_and_rarely_set_bit() (or similar) optimized for the case where the bit is
> likely to already be set.

Thank you for the review. I understand this concern; I have looked for other
parts of the kernel code to see if this change can be generalized, in which case
introducing a new test_and_set_unlikely_bit()/test_test_and_set_bit() would have
been a more appropriate approach indeed. Alas, I could not find more of such
cases, and thus I decided to limit the change to the ext4 code. Maybe developers
with more knowledge of their subsystems know other potential use cases for such
functions.

> I see in your benchmarking that there is not "apples-to-apples" comparisons
> for ARM(Kunpeng) vs. AMD on the same storage. The storage hardware and space
> usage is different for each test run, and the ARM numbers show only marginal
> gains and more negative than positive results at all thread counts:
>
> > Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
> > used), kernel 7.1-rc3. This shows the performance impact on a mostly
> > free file system.
> > | thr. | base | patched | improv. |
> > | | perf | perf | |
> > |------|-------|---------|------------|
> > | 1 | 30901 | 31191 | +0.9384810 |
> > | 2 | 50874 | 50504 | -0.7272870 |
> > | 4 | 66068 | 64108 | -2.9666404 |
> > | 8 | 63963 | 61927 | -3.1830902 |
> > | 16 | 47809 | 47044 | -1.6001171 |
> > | 32 | 42441 | 42326 | -0.2709644 |
> > | 64 | 39773 | 39929 | +0.3922259 |
> > | 128 | 37065 | 36413 | -1.7590719 |
>
>
> The performance reduction might be caused by the now double memory access on
> AMD that is only adding overhead on that CPU implementation? It would be useful
> to see the testing on Kunpeng vs. AMD/Intel on the same storage device/usage.

I've looked a bit deeper into these results (which are different from the 133G
FS image results on AMD). The flame graph is dominated by contention between
open and unlink, e.g.:
... -> __x64_sys_openat -> .. -> path_openat -> down_write ->
-> rwsem_down_write_slowpath -> osq_lock
against
... -> __x64_sys_unlink -> filename_unlinkat -> down_write ->
-> rwsem_down_write_slowpath -> osq_lock
and ext4_mb_prefetch does not feature anywhere in the flame graph. So I think
in this FS configuration, different code path got stressed, the numbers should
be ~0-centered noise? I have definitely seen a correlation with the space usage
percentage of the file system -- on a completely free file system this patch
does not help much irrespective of the architecture, though I did not pinpoint
the root cause for triggering the linear group scan to the usage% specifically,
or if other factors can influence it as well (fragmentation, etc.)

OTOH, the storage device should not matter much for this (admittedly very
limited) kind of benchmark, as the main case where this optimization matters is
if there is a long iteration over groups, which makes ext4_mb_prefetch()
CPU-bound, and this does seem to happen more often as the FS becomes more full.
This was actually the intention of the test with 133G FS image (below the "big"
FS results in the email) -- it should fix parameters other than the underlying
storage device (which should not matter for this test).

> That would tell us if it is more appropriate to optimize this in the aarch64
> test_and_set_bit() rather than in ext4.

I do not believe this change can happen generically in aarch64
test_and_set_bit(), as in a situation where the bit is flipped frequently, this
change will most likely hurt the performance. This probably depends more on the
CPU microarchitecture/cache coherence implementation in hardware.

--
With best regards,
Bohdan Trach