RE: [PATCH 0/17] Add zinc using existing algorithm implementations

From: Pascal Van Leeuwen
Date: Mon Mar 25 2019 - 05:11:59 EST


As someone who has been working on crypto acceleration hardware for the better
part of the past 20 years, I feel compelled to respond to this, in defence of
the crypto API (which we're really happy with ...).

> And honestly, I'm 1000% with Jason on this. The crypto/ model is hard to use,
> inefficient, and completely pointless when you know what your cipher or
> hash algorithm is, and your CPU just does it well directly.
>
> > we even have support already for async accelerators that implement it,
>
> Afaik, none of the async accelerator code has ever been worth anything on
> real hardware and on any sane and real loads. The cost of going outside the
> CPU is *so* expensive that you'll always lose, unless the algorithm has been
> explicitly designed to be insanely hard on a regular CPU.
>
The days of designing them *specifically* to be hard on a CPU are over, but
nevertheless, due to required security properties, *good* crypto is usually
still hard work for a CPU.
Especially *asymmetric* crypto, which can easily take milliseconds per operation
even on a high-end CPU. You can do a lot of interrupt processing in a millisecond
Now some symmetric algorithms (AES,SHA,GHASH) have actually made it *into*
the CPU, somewhat changing the landscape, but:
a) That's really only a small subset of the crypto being used out there.
There's much more to crypto than just AES, SHA and GHASH.
b) This only applies to high-end CPU's. The large majority of embedded CPU's do
not have such embedded crypto instructions. This may change, but I don't
really expect that to happen soon.
c) You're still keeping a big, power-hungry CPU busy for a lot of cycles doing
some fairly trivial work.

> (Corollary: "insanely hard on a regular CPU" is also easy to do by making the
> CPU be weak and bad. Which is not a case we should optimize for).
>

Linux is being used a lot for embedded systems which usually DO have fairly
weak CPU's. I would argue that's exactly the case you SHOULD optimize for,
as that's where you can *really* still make the difference as a programmer.

> The whole "external accelerator" model is odd. It was wrong. It only makes
> sense if the accelerator does *everything* (ie it's the network card), and
> then you wouldn't use the wireguard thing on the CPU at all, you'd have all
> those things on the accelerator (ie a "network card that does WG").
>
> One of the (best or worst, depending on your hangups) arguments for
> external accelerators has been "but I trust the external hardware with the
> key, but not my own code", aka the TPM or Disney argument. I don't think
> that's at all relevant to the discussion either.
>
> The whole model of async accelerators is completely bogus. The only crypto
> or hash accelerator that is worth it are the ones integrated on the CPU cores,
> which have the direct access to caches.
>
NOT true. We wouldn't still be in business after 20 years if that were true.

> And if the accelerator is some tightly coupled thing that has direct access to
> your caches, and doesn't need interrupt overhead or address translation etc
> (at which point it can be worth using) then you might as well just consider it
> an odd version of the above. You'd want to poll for the result anyway,
> because not polling is too expensive.
>
It's HARD to get the interfacing right to take advantage of the acceleration.
And that's exacly why you MUST have an async interface for that: to cope with
the large latency of external acceleration, going through the memory subsystem
and external buses as - as you already pointed out - you cannot access the CPU
cache directly (though you can be fully coherent with it).
So to cope with that latency, you will need to batch queue and pipeline your
processing. This is not unique to crypto acceleration, the same principles
apply to e.g. your GPU as well. Or any HW worth anything, for that matter.

What that DOES mean is that external crypto accelerators are indeed useless for
doing the occasional crypto operation, it really only makes sense for streaming
and/or bulk use cases, such as network packet or disk encryption. And for
operations that really take significant time, such as asymmetic crypto.
For the occasional symmetric crypto operation, by all means, do that on the CPU
using a very thin API layer for efficiency and simplicity. This is where Zinc
makes a lot of sense - I'm not against Zinc at all. But DO realise that when
you go that route, you forfeit any chances of benefitting from acceleration.

Ironically, for doing what Wireguard is doing - bulk packet encryption - the
async crypto API makes a lot more sense than Zinc. In Jason's defense, as
far as I know, there is no Poly/Chacha HW acceleration out there yet, but I
can assure you that that situation is going to change soon :-)
Still, I would really recommend running Wireguard on top of crypto API.
How much performance would you really lose by doing that? If there's
anything wrong with the efficiency of the crypto API implementation of
P/C, then just fix that.

> Just a single interrupt would completely undo all the advantages you got
> from using specialized hardware - both power and performance.
>
We have done plenty of measurements, on both power and performance, to prove
you wrong there. Typically our HW needs *at least* a full order of a magnitude
less power to do the actual work. The CPU load for handing the interrupts etc.
tends to be around 20%. So assuming the CPU goes to sleep for the other 80%
of the time, the combined solution would need about 1/3rd of the power of a
CPU only solution. It's one of our biggest selling points.

As for performance - in the embedded space you can normally expect the crypto
accelerator to be *much* faster than the CPU subsystem as well. So yes, big
multi-core multi-GHz Xeons can do AES-GCM extremely fast and we'd have a hard
time competing with that, but that's not the relevant use case here.

> And that kind of model would work just fine with zinc.

>
> So an accelerator ends up being useful in two cases:
>
> - it's entirely external and part of the network card, so that there's no extra
> data transfer overhead
>
There are inherent efficiency advantages to integrating the crypto there, but
it's also very inflexible as you're stuck with a particular, usually very limited,
implementation. And it doesn't fly very well with current trends of Software
Defined Networking and Network Function Virtualization.
As for data transfer overhead: as long as the SoC memory subsystem was designed
to cope with this, it's really irrelevant as you shouldn't notice it.

> - it's tightly coupled enough (either CPU instructions or some on-die cache
> coherent engine) that you can and will just use it synchronously anyway.
>
> In the first case, you wouldn't run wireguard on the CPU anyway - you have a
> network card that just implements the VPN.
>
The irony here is, that network card would probably be an embedded system running
Linux on an embedded CPU, offloading the crypto operations to our accelerator ...
Most smart NICs are architected that way.

Pascal van Leeuwen,
Silicon IP Architect @ Inside Secure