Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library
From: Jason A. Donenfeld
Date: Thu Aug 02 2018 - 22:48:53 EST
Hey Andy,
Thanks too for the feedback. Responses below:
On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > I think the above changes would also naturally lead to a much saner
> > patch series where each algorithm is added by its own patch, rather than
> > one monster patch that adds many algorithms and 24000 lines of code.
> >
>
> Yes, please.
Ack, will be in v2.
> I like this a *lot*. (But why are you passing have_simd? Shouldn't
> that check live in chacha20_arch? If there's some init code needed,
> then chacha20_arch() should just return false before the init code
> runs. Once the arch does whatever feature detection it needs, it can
> make chacha20_arch() start returning true.)
The have_simd stuff is so that the FPU state can be amortized across
several calls to the crypto functions. Here's a snippet from
WireGuard's send.c:
void packet_encrypt_worker(struct work_struct *work)
{
struct crypt_queue *queue = container_of(work, struct
multicore_worker, work)->ptr;
struct sk_buff *first, *skb, *next;
bool have_simd = simd_get();
while ((first = ptr_ring_consume_bh(&queue->ring)) != NULL) {
enum packet_state state = PACKET_STATE_CRYPTED;
skb_walk_null_queue_safe(first, skb, next) {
if (likely(skb_encrypt(skb, PACKET_CB(first)->keypair, have_simd)))
skb_reset(skb);
else {
state = PACKET_STATE_DEAD;
break;
}
}
queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first, state);
have_simd = simd_relax(have_simd);
}
simd_put(have_simd);
}
simd_get() and simd_put() do the usual irq_fpu_usable/kernel_fpu_begin
dance and return/take a boolean accordingly. simd_relax(on) is:
static inline bool simd_relax(bool was_on)
{
#ifdef CONFIG_PREEMPT
if (was_on && need_resched()) {
simd_put(true);
return simd_get();
}
#endif
return was_on;
}
With this, we most of the time get the FPU amortization, while still
doing the right thing for the preemption case (since kernel_fpu_begin
disables preemption). This is a quite important performance
optimization. However, I'd prefer the lazy FPU restoration proposal
discussed a few months ago, but it looks like that hasn't progressed,
hence the need for FPU call amortization:
https://lore.kernel.org/lkml/CALCETrU+2mBPDfkBz1i_GT1EOJau+mzj4yOK8N0UnT2pGjiUWQ@xxxxxxxxxxxxxx/
>
> As I see it, there there are two truly new thing in the zinc patchset:
> the direct (in the direct call sense) arch dispatch, and the fact that
> the functions can be called directly, without allocating contexts,
> using function pointers, etc.
>
> In fact, I had a previous patch set that added such an interface for SHA256.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5
Seems like SHA256 will be a natural next candidate for Zinc, given the demand.
> > Your patch description is also missing any mention of crypto accelerator
> > hardware. Quite a bit of the complexity in the crypto API, such as
> > scatterlist support and asynchronous execution, exists because it
> > supports crypto accelerators. AFAICS your new APIs cannot support
> > crypto accelerators, as your APIs are synchronous and operate on virtual
> > addresses. I assume your justification is that "djb algorithms" like
> > ChaCha and Poly1305 don't need crypto accelerators as they are fast in
> > software. But you never explicitly stated this and discussed the
> > tradeoffs. Since this is basically the foundation for the design you've
> > chosen, it really needs to be addressed.
>
> I see this as an advantage, not a disadvantage. A very large majority
> of in-kernel crypto users (by number of call sites under a *very*
> brief survey, not by number of CPU cycles) just want to do some
> synchronous crypto on a buffer that is addressed by a regular pointer.
> Most of these users would be slowed down if they used any form of
> async crypto, since the CPU can complete the whole operation faster
> than it could plausibly initiate and complete anything asynchronous.
> And, right now, they suffer the full overhead of allocating a context
> (often with alloca!), looking up (or caching) some crypto API data
> structures, dispatching the operation, and cleaning up.
>
> So I think the right way to do it is to have directly callable
> functions like zinc uses and to have the fancy crypto API layer on top
> of them. So if you actually want async accelerated crypto with
> scatterlists or whatever, you can call into the fancy API, and the
> fancy API can dispatch to hardware or it can dispatch to the normal
> static API.
Yes, exactly this.
Regards,
Jason