Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

From: Jason A. Donenfeld
Date: Mon Nov 07 2016 - 13:09:43 EST

Next message: David Miller: "Re: [PATCH v6 0/7] add NS2 support to bgmac"
Previous message: Josh Triplett: "Re: [PATCH] rcu: update: make RCU_EXPEDITE_BOOT default"
In reply to: Eric Biggers: "Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access"
Next in thread: Jason A. Donenfeld: "Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Eric,

On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> I agree, and the current code is wrong; but do note that this proposal is
> correct for poly1305_setrkey() but not for poly1305_setskey() and
> poly1305_blocks(). In the latter two cases, 4-byte alignment of the source
> buffer is *not* guaranteed. Although crypto_poly1305_update() will be called
> with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the
> algorithm operates on 16-byte blocks and therefore has to buffer partial blocks.
> If some number of bytes that is not 0 mod 4 is buffered, then the buffer will
> fall out of alignment on the next update call. Hence, get_unaligned_le32() is
> actually needed on all the loads, since the buffer will, in general, be of
> unknown alignment.

Hmm... The general data flow that strikes me as most pertinent is
something like:

struct sk_buff *skb = get_it_from_somewhere();
skb = skb_share_check(skb, GFP_ATOMIC);
num_frags = skb_cow_data(skb, ..., ...);
struct scatterlist sg[num_frags];
sg_init_table(sg, num_frags);
skb_to_sgvec(skb, sg, ..., ...);
blkcipher_walk_init(&walk, sg, sg, len);
blkcipher_walk_virt_block(&desc, &walk, BLOCK_SIZE);
while (walk.nbytes >= BLOCK_SIZE) {
size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE);
poly1305_update(&poly1305_state, walk.src.virt.addr, chunk_len);
blkcipher_walk_done(&desc, &walk, walk.nbytes % BLOCK_SIZE);
}
if (walk.nbytes) {
poly1305_update(&poly1305_state, walk.src.virt.addr, walk.nbytes);
blkcipher_walk_done(&desc, &walk, 0);
}

Is your suggestion that that in the final if block, walk.src.virt.addr
might be unaligned? Like in the case of the last fragment being 67
bytes long?

If so, what a hassle. I hope the performance overhead isn't too
awful... I'll resubmit taking into account your suggestions.

By the way -- offlist benchmarks sent to me concluded that using the
unaligned load helpers like David suggested is just as fast as that
handrolled bit magic in the v1.

Regards,
Jason

Next message: David Miller: "Re: [PATCH v6 0/7] add NS2 support to bgmac"
Previous message: Josh Triplett: "Re: [PATCH] rcu: update: make RCU_EXPEDITE_BOOT default"
In reply to: Eric Biggers: "Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access"
Next in thread: Jason A. Donenfeld: "Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]