Re: [PATCH v2] x86/crc32: use builtins to improve code generation

From: David Laight
Date: Tue Mar 04 2025 - 15:53:08 EST

Next message: Daniel Lezcano: "Re: [PATCH] clocksource: exynos_mct: fixed a spelling error"
Previous message: Daniel Lezcano: "Re: [PATCH v5] clocksource/drivers/timer-clint: Add T-Head C9xx clint"
In reply to: David Laight: "Re: [PATCH v2] x86/crc32: use builtins to improve code generation"
Next in thread: Eric Biggers: "Re: [PATCH v2] x86/crc32: use builtins to improve code generation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@xxxxxxxxx> wrote:

....
> > For reference, GCC does much better with code gen, but only with the builtin:
> >
> > .L39:
> > crc32q (%rax), %rbx # MEM[(long unsigned int *)p_40], tmp120
> > addq $8, %rax #, p
> > cmpq %rcx, %rax # _37, p
> > jne .L39 #,
>
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.

Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.

...
> I'll find the code loop I use - machine isn't powered on at the moment.

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static int pmc_id;
static void init_pmc(void)
{
static struct perf_event_attr perf_attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.pinned = 1,
};
struct perf_event_mmap_page *pc;

int perf_fd;
perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
if (perf_fd < 0) {
fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
exit(1);
}
pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
if (pc == MAP_FAILED) {
fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
exit(1);
}
pmc_id = pc->index - 1;
}

static inline unsigned int rdpmc(id)
{
unsigned int low, high;

// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
asm volatile("mfence");
#endif
asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
asm volatile("mfence");
#endif

// return low bits, counter might to 32 or 40 bits wide.
return low;
}

The test code is then something like:
#define PASSES 10
unsigned int ticks[PASSES];
unsigned int tick;
unsigned int i;

for (i = 0; i < PASSES; i++) {
tick = rdpmc(pmc_id);
test_fn(buf, len);
ticks[i] = rdpmc(pmc_id) - tick;
}

for (i = 0; i < PASSES; i++)
printf(" %5d", ticks[i]);

Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.

I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.

The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum
0 0 160 160 160 160 160 160 160 160 160 160 overhead
3637b4f0b942c3c4 682f 316 25 26 26 26 26 26 26 26 26 csum_partial
3637b4f0b942c3c4 682f 124 79 43 25 25 25 24 26 25 24 csum_partial_1
3637b4f0b942c3c4 682f 166 43 25 25 24 24 24 24 24 24 csum_new adc pair
3637b4f0b942c3c4 682f 115 21 21 21 21 21 21 21 21 21 adc_dec_2
3637b4f0b942c3c4 682f 97 34 31 23 24 24 24 24 24 23 adc_dec_4
3637b4f0b942c3c4 682f 39 33 34 21 21 21 21 21 21 21 adc_dec_8
3637b4f0b942c3c4 682f 81 52 49 52 49 26 25 27 25 26 adc_jcxz_2
3637b4f0b942c3c4 682f 62 46 24 24 24 24 24 24 24 24 adc_jcxz_4
3637b4f0b942c3c4 682f 224 40 21 21 23 23 23 23 23 23 adc_2_pair
3637b4f0b942c3c4 682f 42 36 37 22 22 22 22 22 22 22 adc_4_pair_old
3637b4f0b942c3c4 682f 42 37 34 41 23 23 23 23 23 23 adc_4_pair
3637b4f0b942c3c4 682f 122 19 20 19 18 19 18 19 18 19 adcx_adox
bef7a78a9 682f 104 51 30 30 30 30 30 30 30 30 add_c_16
bef7a78a9 682f 143 50 50 27 27 27 27 27 27 27 add_c_32
6ef7a78ae 682f 103 91 45 34 34 34 35 34 34 34 add_c_high

I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.

David

Next message: Daniel Lezcano: "Re: [PATCH] clocksource: exynos_mct: fixed a spelling error"
Previous message: Daniel Lezcano: "Re: [PATCH v5] clocksource/drivers/timer-clint: Add T-Head C9xx clint"
In reply to: David Laight: "Re: [PATCH v2] x86/crc32: use builtins to improve code generation"
Next in thread: Eric Biggers: "Re: [PATCH v2] x86/crc32: use builtins to improve code generation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]