RE: [PATCH] add slice by 8 algorithm to crc32.c

From: Bob Pearson
Date: Mon Aug 08 2011 - 12:51:09 EST




> -----Original Message-----
> From: George Spelvin [mailto:linux@xxxxxxxxxxx]
> Sent: Monday, August 08, 2011 4:28 AM
> To: fzago@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; joakim.tjernlund@xxxxxxxxxxxx;
> linux@xxxxxxxxxxx; rpearson@xxxxxxxxxxxxxxxxxxxxx
> Subject: [PATCH] add slice by 8 algorithm to crc32.c
>
> Sorry I didn't see this when first posted.
>
> The "slice by 8" terminology is pretty confusing. How about
> "Extended Joakim Tjernlund's optimization from commit
> 836e2af92503f1642dbc3c3281ec68ec1dd39d2e to 8-way parallelism."

Here is a link to the article I first read about this algorithm. It mentions
both the 4 and 8 byte version.
I do not know about priority between Joakim and the folks at Intel but Intel
is usually credited with the idea in other articles I have seen. Clearly the
algorithm that is currently in crc32.c is the same as the one described in
the article. As you can see I mis-copied the name from slicing-by-8 to slice
by 8.

http://www.intel.com/technology/comms/perfnet/download/CRC_generators.pdf

>
> Which is essentally what you're doing. The renaming of tab[0] to t0_le
> and t0_be, and removal of the DO_CRC4 macro just increases the diff size.
>
> If you're looking at speeding up the CRC through larger tables, have
> you tried using 10+11+11-bit tables? That would require 20K of tables
> rather than 8K, but would reduce the number of table lookups per byte.
>
>
> One more stunt you could try to increase parallelism: rather than maintain
> the CRC in one register, maintain it in several, and only XOR and collapse
> them at the end.
>
> Start with your 64-bit code, but imagine that the second code block's
> "q = *p32++" always loads 0, and therefore the whole block can be skipped.
> (Since tab[0] = 0 for all CRC tables.)
>
> This computes the CRC of the even words. Then do a second one in parallel
> for the odd words into a separate CRC register. Then combine them at the
> end.
> (Shift one up by 32 bits and XOR into the other.)
>
> This would let you get away with 5K of tables: t4 through t7, and t0.
> t1 through t3 could be skipped.
>
>
> Ideally, I'd write all this code myself, but I'm a bit crunched at work
> right now so wouldn't be able to get to it for a few days.
>
>
>
> Another possible simplification to the startup code. There's no need
> to compute init_bytes explicitly; just loop until the pointer is aligned:
>
> while ((unsigned)buf & 3) {
> if (!len--)
> goto done;
> #ifdef __LITTLE_ENDIAN
> i0 = *buf++ ^ crc;
> crc = t0_le[i0] ^ (crc >> 8);
> #else
> i0 = *buf++ ^ (crc >> 24);
> crc = t0_le[i0] ^ (crc << 8);
> #endif
> }
> p32 = (u32 const *)buf;
> words = len >> 2;
> end_bytes = len & 3;
>
>
> ... although I'd prefer to keep the DO_CRC() and DO_CRC4 macros, and
> extend them to the 64-bit case, to avoid the nested #ifdefs. That would
> make:
>
> while ((unsigned)buf & 3) {
> if (!len--)
> goto done;
> DO_CRC(*buf++);
> }
> p32 = (u32 const *)buf;
> words = len >> 2;
> end_bytes = len & 3;

Personally I don't like macros unless they are very frequently used as you
can probably tell. The ifdefs were somewhat rediced in the second version of
the patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/