Re: Updated scalable urandom patchkit

From: Raymond Jennings
Date: Tue Oct 13 2015 - 00:27:13 EST




On Mon, Oct 12, 2015 at 7:46 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote:
> Segregating abusers solves both problems. If we do this then we don't
> need to drop the locks from the nonblocking pool, which solves the
> security problem.

Er, sort of. I still think my points were valid, but they're
about a particular optimization suggestion you had. By avoiding
the need for the optimization, the entire issue is mooted.

Sure, I'm not in love with anyone's particular optimization, whether
it's mine, yours, or Andi's. I'm just trying to solve the scalability
problem while also trying to keep the code maintainable and easy to
understand (and over the years we've actually made things worse, to
the extent that having a single mixing for the input and output pools
is starting to be more of problem than a feature, since we're coding
in a bunch of exceptions when it's the output pool, etc.).

So if we can solve a problem by routing around it, that's fine in my
book.

You have to copy the state *anyway* because you don't want it overwritten
by the ChaCha output, so there's really no point storing the constants.
(Also, ChaCha has a simpler input block structure than Salsa20; the
constants are all adjacent.)

We're really getting into low-level implementations here, and I think
it's best to worry about these sorts of things when we have a patch to
review.....

(Note: one problem with ChaCha specifically is that is needs 16x32 bits
of registers, and Arm32 doesn't quite have enough. We may want to provide
an arch CPRNG hook so people can plug in other algorithms with good
platform support, like x86 AES instructions.)

So while a ChaCha20-based CRNG should be faster than a SHA-1 based
CRNG, and I consider this a good thing, for me speed is **not** more
important than keeping the underlying code maintainable and simple.
This is one of the reasons why I looked at, and then discarded, to use
x86 accelerated AES as the basis for a CRNG. Setting up AES so that
it can be used easily with or without hardware acceleration looks very
complicated to do in a cross-architectural way, and I don't want to
drag in all of the crypto layer for /dev/random.

The same variables can be used (with different parameters) to decide if
we want to get out of mitigation mode. The one thing to watch out for
is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
the buffer cache fills. We don't want to forgive after too small a
fixed interval.

At least initially, once we go into mitigation mode for a particular
process, it's probably safer to simply not exit it.

Finally, we have the issue of where to attach this rate-limiting structure
and crypto context. My idea was to use the struct file. But now that
we have getrandom(2), it's harder. mm, task_struct, signal_struct, what?

I'm personally more inclined to keep it with the task struct, so that
different threads will use different crypto contexts, just from
simplicity point of view since we won't need to worry about locking.

Since many processes don't use /dev/urandom or getrandom(2) at all,
the first time they do, we'd allocate a structure and hang it off the
task_struct. When the process exits, we would explicitly memzero it
and then release the memory.

(Post-finally, do we want this feature to be configurable under
CONFIG_EMBEDDED? I know keeping the /dev/random code size small is
a speficic design goal, and abuse mitigation is optional.)

Once we code it up we can see how many bytes this takes, we can have
this discussion. I'll note that ChaCha20 is much more compact than SHA1:

text data bss dec hex filename
4230 0 0 4230 1086 /build/ext4-64/lib/sha1.o
1152 304 0 1456 5b0 /build/ext4-64/crypto/chacha20_generic.o

... and I've thought about this as being the first step towards
potentially replacing SHA1 with something ChaCha20 based, in light of
the SHAppening attack. Unfortunately, BLAKE2s is similar to ChaCha
only from design perspective, not an implementation perspective.
Still, I suspect the just looking at the crypto primitives, even if we
need to include two independent copies of the ChaCha20 core crypto and
the Blake2s core crypto, it still should be about half the size of the
SHA-1 crypto primitive.

And from the non-plumbing side of things, Andi's patchset increases
the size of /dev/random by a bit over 6%, or 974 bytes from a starting
base of 15719 bytes. It ought to be possible to implement a ChaCha20
based CRNG (ignoring the crypto primitives) in less than 974 bytes of
x86_64 assembly. :-)

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

This might be stupid, but could something asynchronous work? Perhaps have the entropy generators dump their entropy into a central pool via a cycbuf, and have a background kthread manage the per-cpu or per-process entropy pools?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/