Re: [PATCH v3 00/16] Introduce and use generic parity16/32/64 helper
From: Kuan-Wei Chiu
Date: Fri Apr 04 2025 - 04:51:57 EST
On Thu, Apr 03, 2025 at 12:14:04PM -0400, Yury Norov wrote:
> On Thu, Apr 03, 2025 at 10:39:03PM +0800, Kuan-Wei Chiu wrote:
> > On Tue, Mar 25, 2025 at 12:43:25PM -0700, H. Peter Anvin wrote:
> > > On 3/23/25 08:16, Kuan-Wei Chiu wrote:
> > > >
> > > > Interface 3: Multiple Functions
> > > > Description: bool parity_odd8/16/32/64()
> > > > Pros: No need for explicit casting; easy to integrate
> > > > architecture-specific optimizations; except for parity8(), all
> > > > functions are one-liners with no significant code duplication
> > > > Cons: More functions may increase maintenance burden
> > > > Opinions: Only I support this approach
> > > >
> > >
> > > OK, so I responded to this but I can't find my reply or any of the
> > > followups, so let me go again:
> > >
> > > I prefer this option, because:
> > >
> > > a. Virtually all uses of parity is done in contexts where the sizes of the
> > > items for which parity is to be taken are well-defined, but it is *really*
> > > easy for integer promotion to cause a value to be extended to 32 bits
> > > unnecessarily (sign or zero extend, although for parity it doesn't make any
> > > difference -- if the compiler realizes it.)
> > >
> > > b. It makes it easier to add arch-specific implementations, notably using
> > > __builtin_parity on architectures where that is known to generate good code.
> > >
> > > c. For architectures where only *some* parity implementations are
> > > fast/practical, the generic fallbacks will either naturally synthesize them
> > > from components via shift-xor, or they can be defined to use a larger
> > > version; the function prototype acts like a cast.
> > >
> > > d. If there is a reason in the future to add a generic version, it is really
> > > easy to do using the size-specific functions as components; this is
> > > something we do literally all over the place, using a pattern so common that
> > > it, itself, probably should be macroized:
> > >
> > > #define parity(x) \
> > > ({ \
> > > typeof(x) __x = (x); \
> > > bool __y; \
> > > switch (sizeof(__x)) { \
> > > case 1: \
> > > __y = parity8(__x); \
> > > break; \
> > > case 2: \
> > > __y = parity16(__x); \
> > > break; \
> > > case 4: \
> > > __y = parity32(__x); \
> > > break; \
> > > case 8: \
> > > __y = parity64(__x); \
> > > break; \
> > > default: \
> > > BUILD_BUG(); \
> > > break; \
> > > } \
> > > __y; \
> > > })
> > >
> > Thank you for your detailed response and for explaining the rationale
> > behind your preference. The points you outlined in (a)–(d) all seem
> > quite reasonable to me.
> >
> > Yury,
> > do you have any feedback on this?
> > Thank you.
>
> My feedback to you:
>
> I asked you to share any numbers about each approach. Asm listings,
> performance tests, bloat-o-meter. But you did nothing or very little
> in that department. You move this series, and it means you should be
> very well aware of alternative solutions, their pros and cons.
>
It seems the concern is that I didn't provide assembly results and
performance numbers. While I believe that listing these numbers alone
cannot prove which users really care about parity efficiency, I have
included the assembly results and my initial observations below. Some
differences, like mov vs movzh, are likely difficult to measure.
Compilation on x86-64 using GCC 14.2 with O2 Optimization:
Link to Godbolt: https://godbolt.org/z/EsqPMz8cq
For u8 Input:
- #2 and #3 generate exactly the same assembly code, while #1 replaces
one `mov` instruction with `movzh`, which may slightly slow down the
performance due to zero extension.
- Efficiency: #2 = #3 > #1
For u16 Input:
- As with u8 input, #1 performs an unnecessary zero extension, while #3
replaces one of the `shr` instructions in #2 with a `mov`, making it
slightly faster.
- Efficiency: #3 > #2 > #1
For u32 Input:
- #1 has an additional `mov` instruction compared to #2, and #2 has an
extra `shr` instruction compared to #3.
- Efficiency: #3 > #2 > #1
For u64 Input:
- #1 and #2 generate the same code, but #3 has one less `shr`
instruction compared to the others.
- Efficiency: #3 > #1 = #2
---
Adding -m32 Flag to View Assembly for 32-bit Machine:
Link to Godbolt: https://godbolt.org/z/GrPa86Eq5
For u8 Input:
- #2 and #3 generate identical assembly code, whereas #1 has additional
`mov`, `shr`, and `push/pop` instructions.
- Efficiency: #2 = #3 > #1
For u16 Input:
- #1 uses a lot of `xmm` register operations, making it slower than #2
and #3. Additionally, #2 has an extra `shr` instruction compared to #3.
- Efficiency: #3 > #2 > #1
For u32 Input:
- #1 again uses a lot of `xmm` register operations, so it is slower
than #2 and #3, and #2 has an additional `shr` instruction compared to #3.
- Efficiency: #3 > #2 > #1
For u64 Input:
- Both #1 and #2 use `xmm` register operations, but #1 has a few extra
`movdqa` instructions. #3 is more concise, using a few `shr`, `xor`,
and `mov` instructions to complete the operation.
- Efficiency: #3 > #2 > #1
Regards,
Kuan-Wei