CONFIG_ARCH_SUPPORTS_INT128: Why not mips, s390, powerpc, and alpha?

From: George Spelvin (lkml@xxxxxxx)
Date: Fri Mar 29 2019 - 09:07:27 EST

(Cross-posted in case there are generic issues; please trim if
discussion wanders into single-architecture details.)

I was working on some scaling code that can benefit from 64x64->128-bit
multiplies. GCC supports an __int128 type on processors with hardware
support (including z/Arch and MIPS64), but the support was broken on
early compilers, so it's gated behind CONFIG_ARCH_SUPPORTS_INT128.

Currently, of the ten 64-bit architectures Linux supports, that's
only enabled on x86, ARM, and RISC-V.

SPARC and HP-PA don't have support.

But that leaves Alpha, Mips, PowerPC, and S/390x.

Current mips64, powerpc64, and s390x gcc seems to generate sensible code
for mul_u64_u64_shr() in <linux/math64.h> if I cross-compile them.

I don't have easy access to an Alpha cross-compiler to test, but
as it has UMULH, I suspect it would work, too.

Is there a reason it hasn't been enabled on these platforms?

There might be a MIPS64r6 issue, since r6 changed from DMULTU
writing the lo and hi registers to DMULU/DMUHU, and gcc 8.3, at
least, doesn't know how to generate inline code for the latter.

(Note that users *also* check __INT128__, which is defined if GCC
claims to support __int128, so you don't have to worry about 32-bit
compiles or ancient compilers. It only has to be conditional on
*broken* support.)

FWIW, the code I'm working on has this inner loop:
( for details)

u64 get_random_u64(void);
u64 get_random_max64(u64 range, u64 lim)
unsigned __int128 prod;
do {
prod = (unsigned __int128)get_random_u64() * range;
} while (unlikely((u64)prod < lim));
return prod >> 64;

Which turns into these inner loops:
jal get_random_u64
dmultu $2,$17
mflo $3
sltu $4,$3,$16
bne $4,$0,.L7
mfhi $2

bl get_random_u64
mulld 9,3,31
mulhdu 3,3,31
cmpld 7,30,9
bgt 7,.L9

brasl %r14,get_random_u64@PLT
lgr %r5,%r2
mlgr %r4,%r10
lgr %r2,%r4
clgr %r11,%r5
jh .L13

I like that the MIPS code leaves the high half of the product in
the hi register until it tests the low half; I wish PowerPC would
similarly move the mulhdu *after* the loop, like the following
hypothetical MIPS R6 code:

balc get_random_u64
dmulu $3, $2, $17
sltu $3, $3, $16
bnezc $3, .L7
dmuhu $2, $2, $17

Or this handwritten Alpha code:
bsr $26, get_random_u64
mulq $0, $9, $1 # $9 is range
cmpult $1, $10, $1 # $10 is lim
bne $1, 1b
umulh $0, $9, $0