Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

From: Alexei Starovoitov
Date: Sat Jan 06 2018 - 16:17:40 EST


On Sat, Jan 06, 2018 at 08:22:13PM +0000, Alan Cox wrote:
> > "Value prediction consists of predicting entire 32- and 64-bit register values
> > based on previously-seen values"
>
> For their implementation yes
>
> >
> > > In other words there are at least two problems with Linus proposal
> > >
> > > 1. The ffff/0000 mask has to be generated and that has to involve
> > > speculative flows.
> >
> > to answer above and Thomas's
> > "For one particular architecture and that's not a solution for generic code."
> >
> > The following:
> > #define array_access(base, idx, max) ({ \
> > union { typeof(base[0]) _val; unsigned long _bit; } __u;\
> > unsigned long _i = (idx); \
> > unsigned long _m = (max); \
> > unsigned long _mask = ~(long)(_m - 1 - _i) >> 63; \
> > __u._val = base[_i & _mask]; \
> > __u._bit &= _mask; \
> > __u._val; })
> >
> > is generic and no speculative flows.
>
> In the value speculation case imagine it's been called 1000 times for
> process A which as a limit of say 16 so that file->priv->max is 16, and
> then is run for process B which is different.
>
> A value speculating processor waiting for file->priv->max which has been
> pushed out of cache by an attacker is at liberty to say 'I've no idea
> what max is but hey it was 16 last time so lets plug 16 in and keep going"
>
> So while the change in the mask computation is clever and (subject to
> compiler cleverness) safe against guesses of which path will be taken I
> don't think it's generically safe.
>
> Unfortunately a lot of things we index are of different sizes as seen by
> different tasks, or when passed different other index values so this does
> matter.
>
> > Even if 'mask' in 'index & mask' example is a stall the educated
> > guess will come from the prior value (according to the quoted paper)
>
> Which might be for a different set of variables when the table is say per
> process like file handles, or the value is different each call.
>
> If we have single array of fixed size then I suspect you are right but
> usually we don't.

Thanks. I see your point. Agree on the above.
The variant 1 exploit does 2000 bytes a second using 64-bit address math.
Things like 'fd' are 32-bit, so it's magnitude higher attack
complexity already (without any kernel changes).
If we do above array_access() the exploit complexity increases even more.
More so the attacker would need to train fdt->max_fds on a known
good fdt with millions of files for 100s of iterations only to do
one speculative access on another fdt with small max_fds
(to exploit value speculation from large max_fds)
while keeping cache line for that speculative out-of-bounds access on
small fdt empty and measuring cache load times on another cpu.
I frankly don't see such attack being able to keep cache lines pristine
for that small fdt speculation doing hundreds of non-speculative
accesses on another fdt. Way too many moving pieces.
Even if it would be practical the speed probably going to be in bytes per second,
so to read anything meaningful an attack detection techniques (that people
are actively working on) will be able to catch it.
At the end security cannot be absolute.
The current level of paranoia shouldn't force us to make hastily decisions.

So how about we do array_access() macro similar to above by default
with extra CONFIG_ to convert it to lfence ?
Why default to AND approach instead of lfence ?
Because the kernel should still be usable. If security
sacrifices performance so much such security will be turned off.
Ex: kpti suppose to add 5-30%. If it means 10% on production workload
and the datacenter capacity cannot grow 10% overnight, kpti will be off.