Re: [PATCH v2 1/2] ext4: avoid RWM atomic in EXT4_MB_GRP_TEST_AND_SET_READ

From: Jan Kara

Date: Sat May 30 2026 - 05:53:28 EST

On Wed 27-05-26 13:46:43, Andreas Dilger wrote:
> On May 27, 2026, at 03:03, Bohdan Trach <bohdan.trach@xxxxxxxxxxxxxxx> wrote:
> >
> > EXT4_MB_GRP_TEST_AND_SET_READ uses test_and_set_bit function which
> > issues an atomic write. This can cause high overhead due to cache
> > contention when multiple threads iterate over groups in a tight loop,
> > as is the case for ext4_mb_prefetch(). We have seen this to be a
> > problem for Kunpeng 920b CPUs which uses a single ARM LSE instruction
> > for this purpose.
> >
> > Avoid this unconditional atomic write by testing the bit first without
> > changing its value. This is OK for this use case as this bit is never
> > unset.
> >
> > This change significantly reduces costs of fallocate() operations which
> > trigger linear group scans on large multicore machines where
> > test_and_set_bit issues an atomic write operation unconditionally.
> >
> > Signed-off-by: Bohdan Trach <bohdan.trach@xxxxxxxxxxxxxxx>
>
> Thanks for the patch. Definitely the benchmarks in the 0/2 email show
> significant gains for the Kunpeng system, and reducing contention makes sense
> as core counts increase and the likely case is that the bit is already set.
>
> That said, I wonder if this should (also/instead) be put into test_and_set_bit()
> itself, or add test_and_unlikely_set_bit() or test_and_rarely_set_bit()
> (or similar) optimized for the case where the bit is likely to already be set.

Well, it is a common enough pattern in the kernel that it might be worth it
but I'd say that's a separate effort spanning much more areas.

> I see in your benchmarking that there is not "apples-to-apples"
> comparisons for ARM(Kunpeng) vs. AMD on the same storage. The storage
> hardware and space usage is different for each test run, and the ARM
> numbers show only marginal gains and more negative than positive results
> at all thread counts:
>
> > Benchmark on an existing file system for AMD 9654 (15T FS, 6% space
> > used), kernel 7.1-rc3. This shows the performance impact on a mostly
> > free file system.
> > | thr. | base | patched | improv. |
> > | | perf | perf | |
> > |------|-------|---------|------------|
> > | 1 | 30901 | 31191 | +0.9384810 |
> > | 2 | 50874 | 50504 | -0.7272870 |
> > | 4 | 66068 | 64108 | -2.9666404 |
> > | 8 | 63963 | 61927 | -3.1830902 |
> > | 16 | 47809 | 47044 | -1.6001171 |
> > | 32 | 42441 | 42326 | -0.2709644 |
> > | 64 | 39773 | 39929 | +0.3922259 |
> > | 128 | 37065 | 36413 | -1.7590719 |
>
> The performance reduction might be caused by the now double memory access on
> AMD that is only adding overhead on that CPU implementation? It would be useful
> to see the testing on Kunpeng vs. AMD/Intel on the same storage device/usage.
>
> That would tell us if it is more appropriate to optimize this in the aarch64
> test_and_set_bit() rather than in ext4.

Well, even for x86 test_bit() || test_and_set_bit() is significantly
cheaper when the cacheline is hot because test_and_set_bit()
unconditionally invalidates the cacheline for all the other CPUs. And the
cost of additional test_bit() for something that is already in L1 cache is
very small (a few cycles) so I think the above differences are either pure
benchmarking noise or due to changes in code alignment or similar.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR