Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

From: Palmer Dabbelt
Date: Wed Jul 22 2020 - 22:13:12 EST

Next message: Can Guo: "Re: [PATCH v4 3/8] ufs: ufs-qcom: Fix a few BUGs in func ufs_qcom_dump_dbg_regs()"
Previous message: Stephen Rothwell: "Re: linux-next: manual merge of the bpf-next tree with the net tree"
In reply to: Chenxi Mao: "Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I"
Next in thread: Chenxi Mao: "Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 22 Jul 2020 18:59:12 PDT (-0700), maochenxi@xxxxxxxxx wrote:

Hi Palmer and Emil:

As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.

Sorry, I guess I crossed up my emails. I think it's best to just drop this for
now, as it doesn't actually seem to generate better code for our current
target.

My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)

The disassembly code as below:

CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:

0000000000000000 <__sw_hweight32>:
ÂÂ 0:ÂÂÂ 555557b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0x55555
ÂÂ 4:ÂÂÂ 0015571bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a4,a0,0x1
ÂÂ 8:ÂÂÂ 55578793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
ÂÂ c:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
ÂÂ e:ÂÂÂ 9d1dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ subwÂÂÂ a0,a0,a5

0000000000000010 <.LVL1>:
Â 10:ÂÂÂ 333337b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0x33333
Â 14:ÂÂÂ 33378793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
Â 18:ÂÂÂ 0025571bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a4,a0,0x2
Â 1c:ÂÂÂ 8d7dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a0,a0,a5
Â 1e:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
Â 20:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
Â 22:ÂÂÂ 0047d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x4
Â 26:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
Â 28:ÂÂÂ 0f0f1537ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a0,0xf0f1
Â 2c:ÂÂÂ 1141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,-16
Â 2e:ÂÂÂ f0f50513ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
Â 32:ÂÂÂ e422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ sdÂÂÂ s0,8(sp)
Â 34:ÂÂÂ 8fe9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a0
Â 36:ÂÂÂ 0800ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ s0,sp,16
Â 38:ÂÂÂ 0087951bÂÂÂÂÂÂÂÂÂ ÂÂÂ slliwÂÂÂ a0,a5,0x8
Â 3c:ÂÂÂ 6422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ldÂÂÂ s0,8(sp)
Â 3e:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
Â 40:ÂÂÂ 0105179bÂÂÂÂÂÂÂÂÂ ÂÂÂ slliwÂÂÂ a5,a0,0x10
Â 44:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
Â 46:ÂÂÂ 0185551bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a0,0x18
Â 4a:ÂÂÂ 0141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,16
Â 4c:ÂÂÂ 8082ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ret

CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:

000000000000004e <__sw_hweight32_default>:
Â 4e:ÂÂÂ 55555737ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a4,0x55555
Â 52:ÂÂÂ 0015579bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a5,a0,0x1
Â 56:ÂÂÂ 55570713ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
Â 5a:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
Â 5c:ÂÂÂ 9d1dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ subwÂÂÂ a0,a0,a5

000000000000005e <.LVL3>:
Â 5e:ÂÂÂ 333337b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0x33333
Â 62:ÂÂÂ 33378793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
Â 66:ÂÂÂ 0025571bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a4,a0,0x2
Â 6a:ÂÂÂ 8d7dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a0,a0,a5
Â 6c:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
Â 6e:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
Â 70:ÂÂÂ 0047d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x4
Â 74:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
Â 76:ÂÂÂ 0f0f17b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0xf0f1
Â 7a:ÂÂÂ 1141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,-16
Â 7c:ÂÂÂ f0f78793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
Â 80:ÂÂÂ e422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ sdÂÂÂ s0,8(sp)
Â 82:ÂÂÂ 8fe9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a0
Â 84:ÂÂÂ 0800ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ s0,sp,16
Â 86:ÂÂÂ 0087d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x8
Â 8a:ÂÂÂ 6422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ldÂÂÂ s0,8(sp)
Â 8c:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
Â 8e:ÂÂÂ 0107d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x10
Â 92:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
Â 94:ÂÂÂ 0ff57513ÂÂÂÂÂÂÂÂÂ ÂÂÂ andiÂÂÂ a0,a0,255
Â 98:ÂÂÂ 0141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,16
Â 9a:ÂÂÂ 8082ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ret

This 2 implementations is almost same but small differences.

Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition,Â below code didn't use "mul" instructions.

ÂÂÂ " return (w * 0x01010101) >> 24; "

So I am trying to translate this code with inline assembly as below:

//return (w * 0x01010101) >> 24;
__asm__ (
" mul %0, %0, %1\n"
: "+r" (w)
: "r" (w), "r"(0x01010101)
:);
return w >> 24;

After above change, the disassambly as below:
0000000000000000 <__sw_hweight32>:
ÂÂ 0:ÂÂ Â555557b7ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa5,0x55555
ÂÂ 4:ÂÂ Â0015571bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa4,a0,0x1
ÂÂ 8:ÂÂ Â55578793ÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âa5,a5,1365 # 55555555 <.LASF5+0x55555119>
ÂÂ c:ÂÂ Â8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa5,a5,a4
ÂÂ e:ÂÂ Â9d1dÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂsubwÂÂ Âa0,a0,a5

0000000000000010 <.LVL1>:
Â 10:ÂÂ Â333337b7ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa5,0x33333
Â 14:ÂÂ Â0025571bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa4,a0,0x2
Â 18:ÂÂ Â33378793ÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âa5,a5,819 # 33333333 <.LASF5+0x33332ef7>
Â 1c:ÂÂ Â8d7dÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa0,a0,a5
Â 1e:ÂÂ Â8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa5,a5,a4
Â 20:ÂÂ Â9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddwÂÂ Âa5,a5,a0
Â 22:ÂÂ Â0047d71bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa4,a5,0x4
Â 26:ÂÂ Â9f3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddwÂÂ Âa4,a4,a5
Â 28:ÂÂ Â0f0f17b7ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa5,0xf0f1
Â 2c:ÂÂ Â1141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âsp,sp,-16
Â 2e:ÂÂ Âf0f78793ÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âa5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
Â 32:ÂÂ Âe422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂsdÂÂ Âs0,8(sp)
Â 34:ÂÂ Â8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa5,a5,a4
Â 36:ÂÂ Â0800ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âs0,sp,16
Â 38:ÂÂ Â01010737ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa4,0x1010
Â 3c:ÂÂ Â853eÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂmvÂÂ Âa0,a5

000000000000003e <.LVL2>:
Â 3e:ÂÂ Â1017071bÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiwÂÂ Âa4,a4,257
Â 42:ÂÂ Â02f50533ÂÂÂÂÂÂÂÂ ÂÂÂ ÂmulÂÂ Âa0,a0,a5
Â 46:ÂÂ Â6422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂldÂÂ Âs0,8(sp)
Â 48:ÂÂ Â0185551bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa0,a0,0x18

"mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.

Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.

I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only

load 20 bits per time, is this the best way to load instants?

On the other hand, I try to compare ARM64 disassembly code:

.....

ÂÂ 4:ÂÂÂ 3200c3e2 ÂÂÂ movÂÂÂ w2, #0x1010101ÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ // #16843009

......

ÂÂ w =Â (w + (w >> 4)) & 0x0f0f0f0f;
Â 20:ÂÂÂ 0b401000 ÂÂÂ addÂÂÂ w0, w0, w0, lsr #4
Â 24:ÂÂÂ 1200cc00 ÂÂÂ andÂÂÂ w0, w0, #0xf0f0f0f
ÂÂÂ return (w * 0x01010101) >> 24;
Â 28:ÂÂÂ 1b027c00 ÂÂÂ mulÂÂÂ w0, w0, w2

Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.

Let me summary as below:

1.Â GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.

2. force to generate "mul" didn't get better because instants load waste instructions.

3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.

Âunsigned int __sw_hweight32(unsigned int w)
Â{
-#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+/*
+ * Risc-V could not generate mul(w) instruction in this case
+ */
+#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
ÂÂÂÂÂÂÂ w -= (w >> 1) & 0x55555555;
ÂÂÂÂÂÂÂ w =Â (w & 0x33333333) + ((w >> 2) & 0x33333333);
ÂÂÂÂÂÂÂ w =Â (w + (w >> 4)) & 0x0f0f0f0f;

Chenxi

On 2020/7/21 äå9:17, Palmer Dabbelt wrote:

On Wed, 08 Jul 2020 22:19:22 PDT (-0700), maochenxi@xxxxxxxxx wrote:

Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
which works fine on GCC-9.3 and GCC-10.1

PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.

Signed-off-by: Chenxi Mao <maochenxi@xxxxxxxxx>
---
Âarch/riscv/Kconfig | 1 +
Â1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 128192e14ff2..84e6777fecad 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -202,6 +202,7 @@ config ARCH_RV64I
ÂÂÂÂ bool "RV64I"
ÂÂÂÂ select 64BIT
ÂÂÂÂ select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
+ÂÂÂ select ARCH_HAS_FAST_MULTIPLIER
ÂÂÂÂ select HAVE_DYNAMIC_FTRACE if MMU
ÂÂÂÂ select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
ÂÂÂÂ select HAVE_FTRACE_MCOUNT_RECORD

Ah, thanks -- this one didn't show up when I was looking at the last one.Â I
think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
difference there.Â I guess in theory we should be sticking this all in some
sort of "platform type" optimization flags, but that's probably bit much for
now.

Next message: Can Guo: "Re: [PATCH v4 3/8] ufs: ufs-qcom: Fix a few BUGs in func ufs_qcom_dump_dbg_regs()"
Previous message: Stephen Rothwell: "Re: linux-next: manual merge of the bpf-next tree with the net tree"
In reply to: Chenxi Mao: "Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I"
Next in thread: Chenxi Mao: "Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]