Re: [PATCH] x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue

From: Linus Torvalds
Date: Tue Apr 28 2015 - 13:16:42 EST


On Tue, Apr 28, 2015 at 9:58 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> Well, AFAIK, NOPs do require resources for tracking in the machine. I
> was hoping that hw would be smarter and discard at decode time but there
> probably are reasons that it can't be done (...yet).

I suspect it might be related to things like getting performance
counters and instruction debug traps etc right. There are quite
possibly also simply constraints where the front end has to generate
*something* just to keep the back end happy.

The front end can generally not just totally remove things without any
tracking, since the front end doesn't know if things are speculative
etc. So you can't do instruction debug traps in the front end afaik.
Or rather, I'm sure you *could*, but in general I suspect the best way
to handle nops without making them *too* special is to bunch up
several to make them look like one big instruction, and then associate
that bunch with some minimal tracking uop that uses minimal resources
in the back end without losing sight of the original nop entirely, so
that you can still do checks at retirement time.

So I think the "you can do ~5 nops per cycle" is not unreasonable.
Even in the uop cache, the nops have to take some space, and have to
do things like update eip, so I don't think they'll ever be entirely
free, the best you can do is minimize their impact.

> $ taskset -c 3 ./t
> Running 60 times, 1000000 loops per run.
> nop_0x90 average: 0.390625
> nop_3_byte average: 0.390625
>
> and those exact numbers are actually reproducible pretty reliably.

Yeah. That looks somewhat reasonable. I think the 16h architecture
technically decodes just two instructions per cycle, but I wouldn't be
surprised if there's some simple nop special casing going on so that
it can decode three nops in one go when things line up right. So you
might get 0.33 cycles for the best case, but then 0.5 cycles when it
crosses a 16-byte boundary or something. So you might have some
pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes
(3/2/3 instructions), which would come out to 0.38 cycles. Add some
random overhead for the loop, and I could see the 0.39 cycles.

That was wild handwaving with no data to back it up, but I'm trying to
explain to myself why you could get some odd number like that. It
seems _possiible_ at least.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/