Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
From: Alexei Starovoitov
Date: Sat Jan 06 2018 - 13:13:42 EST
On Sat, Jan 06, 2018 at 12:32:42PM +0000, Alan Cox wrote:
> On Fri, 5 Jan 2018 18:52:07 -0800
> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > > From: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> > >
> > > When access_ok fails we should always stop speculating.
> > > Add the required barriers to the x86 access_ok macro.
> >
> > Honestly, this seems completely bogus.
>
> Also for x86-64 if we are trusting that an AND with a constant won't get
> speculated into something else surely we can just and the address with ~(1
> << 63) before copying from/to user space ? The user will then just
> speculatively steal their own memory.
+1
Any type of straight line code can address variant 1.
Like changing:
array[index]
into
array[index & mask]
works even when 'mask' is a variable.
To proceed with speculative load from array cpu has to speculatively
load 'mask' from memory and speculatively do '&' alu.
If attacker cannot influence 'mask' the speculative value of it
will bound 'index & mask' value to be within array limits.
I think "lets sprinkle lfence everywhere" approach is going to
cause serious performance degradation. Yet people pushing for lfence
didn't present any numbers.
Last time lfence was removed from the networking drivers via dma_rmb()
packet-per-second metric jumped 10-30%. lfence forces all outstanding loads
to complete. If any prior load is waiting on L3 or memory,
lfence will cause 100+ ns stall and overall kernel performance will tank.
If kernel adopts this "lfence everywhere" approach it will be
the end of the kernel as we know it. All high performance operations
will move into user space. Networking and IO will be first.
Since it will takes years to design new cpus and even longer
to upgrade all servers the industry will have no choice,
but to move as much logic as possible from the kernel.
kpti already made crossing user/kernel boundary slower, but
kernel itself is still fast. If kernel will have lfence everywhere
the kernel itself will be slow.
In that sense retpolining the kernel is not as horrible as it sounds,
since both user space and kernel has to be retpolined.