Re: [RFC v2 05/83] Add NOVA filesystem definitions and useful helper routines.

From: Andiry Xu
Date: Mon Mar 19 2018 - 15:40:11 EST


On Sun, Mar 11, 2018 at 12:22 PM, Eric Biggers <ebiggers3@xxxxxxxxx> wrote:
> On Sun, Mar 11, 2018 at 02:00:13PM +0200, Nikolay Borisov wrote:
>> [Adding Herbert Xu to CC since he is the maintainer of the crypto subsys
>> maintainer]
>>
>> On 10.03.2018 20:17, Andiry Xu wrote:
>> <snip>
>>
>> > +static inline u32 nova_crc32c(u32 crc, const u8 *data, size_t len)
>> > +{
>> > + u8 *ptr = (u8 *) data;
>> > + u64 acc = crc; /* accumulator, crc32c value in lower 32b */
>> > + u32 csum;
>> > +
>> > + /* x86 instruction crc32 is part of SSE-4.2 */
>> > + if (static_cpu_has(X86_FEATURE_XMM4_2)) {
>> > + /* This inline assembly implementation should be equivalent
>> > + * to the kernel's crc32c_intel_le_hw() function used by
>> > + * crc32c(), but this performs better on test machines.
>> > + */
>> > + while (len > 8) {
>> > + asm volatile(/* 64b quad words */
>> > + "crc32q (%1), %0"
>> > + : "=r" (acc)
>> > + : "r" (ptr), "0" (acc)
>> > + );
>> > + ptr += 8;
>> > + len -= 8;
>> > + }
>> > +
>> > + while (len > 0) {
>> > + asm volatile(/* trailing bytes */
>> > + "crc32b (%1), %0"
>> > + : "=r" (acc)
>> > + : "r" (ptr), "0" (acc)
>> > + );
>> > + ptr++;
>> > + len--;
>> > + }
>> > +
>> > + csum = (u32) acc;
>> > + } else {
>> > + /* The kernel's crc32c() function should also detect and use the
>> > + * crc32 instruction of SSE-4.2. But calling in to this function
>> > + * is about 3x to 5x slower than the inline assembly version on
>> > + * some test machines.
>>
>> That is really odd. Did you try to characterize why this is the case? Is
>> it purely the overhead of dispatching to the correct backend function?
>> That's a rather big performance hit.
>>
>> > + */
>> > + csum = crc32c(crc, data, len);
>> > + }
>> > +
>> > + return csum;
>> > +}
>> > +
>
> Are you sure that CONFIG_CRYPTO_CRC32C_INTEL was enabled during your tests and
> that the accelerated version was being called? Or, perhaps CRC32C_PCL_BREAKEVEN
> (defined in arch/x86/crypto/crc32c-intel_glue.c) needs to be adjusted. Please
> don't hack around performance problems like this; if they exist, they need to be
> fixed for everyone.
>

I have performed the crc32c test on a Xeon X5647 at 2.93GHz, 14G DDR3
memory at 1066MHz platform.
You are right that enabling CONFIG_CRYPTO_CRC32C_INTEL improves the
performance significantly. nova_crc32c() is still slightly faster than
crc32c() with the flag enabled.

Result numbers are follows: data size in bytes, latency in ns, column
3 is crc32c() with CONFIG_CRYPTO_CRC32C_INTEL enabled and column 4
disabled.

data size (bytes) nova_crc32c() crc32c() -enabled
crc32c() -disabled
64 19 21
56
128 28 29
99
256 46 43
182
512 82 149
354
1024 157 232
728
2048 305 415
1440
4096 603 725
2869

Thanks,
Andiry