Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

From: Jason A. Donenfeld
Date: Thu Nov 03 2016 - 03:25:12 EST

Hi Herbert,

On Thu, Nov 3, 2016 at 1:49 AM, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
> FWIW I'd rather live with a 6% slowdown than having two different
> code paths in the generic code. Anyone who cares about 6% would
> be much better off writing an assembly version of the code.

Please think twice before deciding that the generic C "is allowed to
be slow". It turns out to be used far more often than might be
obvious. For example, crypto is commonly done on the netdev layer --
like the case with mac80211-based drivers. At this layer, the FPU on
x86 isn't always available, depending on the path used. Some
combinations of drivers, packet family, and workload can result in the
generic C being used instead of the vectorized assembly for a massive
percentage of time. So, I think we do have a good motivation for
wanting the generic C to be as fast as possible.

In the particular case of poly1305, these are the only spots where
unaligned accesses take place, and they're rather small, and I think
it's pretty obvious what's happening in the two different cases of
code from a quick glance. This isn't the "two different paths case" in
which there's a significant future-facing maintenance burden.