Re: [PATCH v2] RAS/AMD/ATL: Remove bitwise_xor_bits

From: Nikolay Borisov
Date: Mon Nov 24 2025 - 08:26:24 EST




On 11/24/25 14:52, Borislav Petkov wrote:
On Mon, Nov 24, 2025 at 08:03:02PM +0800, Kuan-Wei Chiu wrote:
On Mon, Nov 24, 2025 at 12:05:26PM +0100, Borislav Petkov wrote:
On Mon, Nov 24, 2025 at 04:57:51PM +0800, Kuan-Wei Chiu wrote:
Both LLVM/GCC support a __builtin_parity function which is functionally
equivalent to the custom bitwise_xor_bits() one. Let's simplify the code by
relying on the built-in. No functional changes.

IIRC in some cases,

Which are those cases?

Do you have a trigger scenario?

I did a quick search, and I believe it was this kernel test robot
report [1] that reminded me of this compiler behavior.

[1]: https://lore.kernel.org/oe-kbuild-all/202501312159.l6jNRaYy-lkp@xxxxxxxxx/

Interesting, thanks for the pointer.

@Nik, just use hweight16() but pls do check what asm the compiler generates
before and after so that at least there's some palpable improvement or gcc is
smart enough to replace the unrolled XORing with something slick.

Also put in the commit message why we're not using the builtin parity thing
and quote the link above.

Thx.



So the custom function one results in the following asm being inlined i.e 4 total copies:

# drivers/ras/amd/atl/umc.c:59: tmp ^= (val >> i) & 0x1;
xorl %ecx, %ecx # ivtmp.157
# drivers/ras/amd/atl/umc.c:55: u16 tmp = 0;
xorl %esi, %esi # tmp
# drivers/ras/amd/atl/umc.c:253: temp = bitwise_xor_bits(col & addr_hash.bank[i].col_xor);
andl %r13d, %edx # col, tmp249
# drivers/ras/amd/atl/umc.c:59: tmp ^= (val >> i) & 0x1;
movzwl %dx, %edx # tmp249, _387
.L3:
movl %edx, %eax # _387, tmp250
sarl %cl, %eax # ivtmp.157, tmp250
# drivers/ras/amd/atl/umc.c:58: for (i = 0; i < 16; i++)
addl $1, %ecx #, ivtmp.157
# drivers/ras/amd/atl/umc.c:59: tmp ^= (val >> i) & 0x1;
andl $1, %eax #, tmp251
# drivers/ras/amd/atl/umc.c:59: tmp ^= (val >> i) & 0x1;
xorl %eax, %esi # tmp251, tmp
# drivers/ras/amd/atl/umc.c:58: for (i = 0; i < 16; i++)
cmpl $16, %ecx #, ivtmp.157
jne .L3 #,


Whilst with hweight, if the cpu has popcnt it will be 2 instructions - popcnt and an andl:

# drivers/ras/amd/atl/umc.c:243: temp ^= hweight16(row & addr_hash.bank[i].row_xor) & 1;
andl %ebp, %edi # _321, tmp221
# ./arch/x86/include/asm/arch_hweight.h:19: asm_inline (ALTERNATIVE("call __sw_hweight32",
#APP
# 19 "./arch/x86/include/asm/arch_hweight.h" 1
# ALT: oldinstr
771:
call __sw_hweight32
772:
# ALT: padding
.skip -(((775f-774f)-(772b-771b)) > 0) * ((775f-774f)-(772b-771b)),0x90
773:
.pushsection .altinstructions,"a"
.long 771b - .
.long 774f - .
.4byte ( 4*32+23)
.byte 773b-771b
.byte 775f-774f
.popsection
.pushsection .altinstr_replacement, "ax"
# ALT: replacement
774:
popcntl %edi, %eax # tmp221, res
775:
.popsection
# 0 "" 2
# drivers/ras/amd/atl/umc.c:243: temp ^= hweight16(row & addr_hash.bank[i].row_xor) & 1;
#NO_APP
xorl %eax, %edx # res, tmp222
andl $1, %edx #, temp


So yeah, much better IMHO, unless there is some hidden latency in the popcnt...