Re: Updated scalable urandom patchkit

From: Theodore Ts'o
Date: Mon Oct 12 2015 - 09:54:51 EST

Next message: Jon Hunter: "Re: [PATCH V2 1/2] Documentation: DT: Add binding documentation for NVIDIA ADMA"
Previous message: Dave Martin: "Re: [PATCHv2] ARM64:Fix MINSIGSTKSZ and SIGSTKSZ"
In reply to: George Spelvin: "Re: Updated scalable urandom patchkit"
Next in thread: George Spelvin: "Re: Updated scalable urandom patchkit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote:
>
> Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12.
> (I'll also use a count-up index even though I know it's actually
> count-down in the code.)
>
> CPU0 writes 5 words at 0..4, leaving add_ptr = 5.
> Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10
> The CPU 0 writes its next word at 10.
>
> Words 5..9 never got mixed at all.

Hmm, good point. It *shouldn't* matter if the hash is secure, but the
potentially uneven mixing would be a concern.

> However, I *also* have to use atomic_add to write r->pool[i],
> which is indeed expensive. PoC diff attached, but I'm not sure if
> you'll like it.

Yeah, that was the part that worried me. Whether we use an atomic add
or an atomic xor, it's going to be very expensive. Using an atomic add
means that it will work with arm, but there's an explicit loop with
arm, so on an arm system, it effectively turns into a more efficient
spinlock, except you're spinning on each byte, so instead of bouncing
cache lines with the locking granularity of the extract, we would be
bouncing cache lines for every single byte of the mixback operation.

> You're forgetting: keyboard codes don't go into the input pool.
> They go into the per-CPU fast pool, which does some quite thorough
> mixing and delivers 128 bits with the entropy effectively distributed
> across it.

They used to go through the input pool, and when we added the per-CPU
fast pool, we didn't revisit the question of whether the input_rotate
was still needed. So yes, at this point it might be worthwhile to
remove the input_rotate part of the mixing scheme.

> Segregating abusers so they don't cause *performance* problems
> for other threads is a reasonable optimization.
>
> But the quoted conversation was talking about a *security*
> problem that could be created by /dev/urandom abusers if some
> of your lock-relaxing ideas were followed.

Segregating abusers solves both problems. If we do this then we don't
need to drop the locks from the nonblocking pool, which solves the
security problem. And it also means that each process has its own
CRNG, which is way faster than /dev/urandom even if you don't have the
locking problem, and it provides per-process isolation, which means
that process A can't carry out a backtracking attack on process B.

> > Most processes don't actually use that much randomness, and I'm not
> > that worried about in-kernel users of the nonblocking pool. Even with
> > the most exec-heavy workload, setting up a new exec image is
> > heavyweight enough that you're really not going to be contending on
> > the lock. I also have trouble with someone spending $$$$ on a system
> > with 1K cpu cores and wasting all of their CPU power with running
> > shell scripts that fork and exec a lot. :-)
>
> But I just re-read Andi's original trouble report, and although he mentions
> AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's
> causing the problem *not* execve. My memory may have gotten confused.
>
> I can imagine that one /dev/urandom abuser can minopolize the locck and
> cause problems for a shell script.

Well, that's my point. If we restrict the nonblocking pool to kernel
users, then a /dev/urandom abuser won't be able to cause problems for
the shell script. And even the most inefficient shell script which is
firing off hundreds of processes per second is unlikely to be able to
swamp the spin loop, *and* I would hope that someone who spent $$$ on
a super-expensive 1K cpu system would also have the wit to actually
recode the shell script in a more efficient compiled language, instead
of wasting all of that expensive hardware.

> Remember that making antibacktracking work is the entire problem we're
> struggling to solve. If discarding it is on the table, I can solve
> the scalability problems extremely easily. Even without giving it up
> completely, just put the add-back inside if(trylock()) and the problem
> goes away.

The trylock() scheme is a bit dangerous, since we would need to make
sure that *all* times when other callers take the lock, the contents
of the pool would have changed. Or else, a subseqent call to extract
entropy from that CPU will return the same value. And even if that
were true today (or we could make it be true) if that ever changed, it
would break the code. So it makes it pretty fragile.

> I'm also not sure where you get 72 bytes from. You need 32 bytes of
> key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the
> ChaCha input block.

The ChaCha 20 state block is 64 bytes, and then I was assuming two 4
byte control variables, for the number of cranks and the time since
last reseed. We could actually store those control variables into the
IV portion of the ChaCha state structure, which would make it be 64
bytes --- or we could generate a new state structure each time and not
bother storing the 16 bytes of constants in the state structure, which
trades a tiny amount of time and stack space for memory. These are
all implementation details, though....

> But why bother with even that much? If you're willing to discard
> antibacktracking, just have *one* key for the kernel, reseeded more
> often, and a per-thread nonce and stream position.

Well, we're not completely discarding backtracking protection. We are
discarding backtracking protection between successive reads from a
single process, and even there we would be reseeding every five
minutes (and this could be tuned), so there is *some*
anti-backtracking protection.

I'm going to make the claim that if a particular process is reading
from /dev/urandom in a tight loop, you probably don't *need*
backtracking. You're probably doing something silly such as using
/dev/urandom to overwrite a disk, or some such.

On the flip side, the time when you might care about anti-backtracing
protection is say, when you're generating a new session key for a new
connection. So perhaps one approach is to use some kind of
ratelimiter algorithm so that if you're using /dev/urandom "carefully"
(say, no more than ten times a second), we'll use the non-blocking
pool. But once a process exceeds that limit, it will switch over the
the CRNG, and then the only performance that abuser process will hurt
is its own (because it would be even faster if they were running
something like arc4random in userspace).

> But without interfering with legitimate users of /dev/urandom at all,
> I'd be quite willing to say that as soon as you read more than 32 bytes
> (current request plus recent history) from /dev/urandom, you get a
> private ChaCha20 structure to read from.

Yeah, that's what I was thinking, although I was willing to be a bit
more generous about when we switch them over to using the ChaCha pool.
I think doing this automatically is going to be much better than
creating a new /dev/frandom, since it's going to be hard to get
applications to switch over. We're still several years away from
swtiching people over to the new getrandom(2) system call, after all.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jon Hunter: "Re: [PATCH V2 1/2] Documentation: DT: Add binding documentation for NVIDIA ADMA"
Previous message: Dave Martin: "Re: [PATCHv2] ARM64:Fix MINSIGSTKSZ and SIGSTKSZ"
In reply to: George Spelvin: "Re: Updated scalable urandom patchkit"
Next in thread: George Spelvin: "Re: Updated scalable urandom patchkit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]