Re: RFC: Petition Intel/AMD to add POPF_IF insn

From: Linus Torvalds
Date: Thu Aug 18 2016 - 22:29:45 EST


On Thu, Aug 18, 2016 at 6:26 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>
> I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
> if we set iopl(3) first.

Yes, but it might not be the same. So the timings could be very
different from a cpl0 case.

Also:

> Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI
> converges to about ~0.5 insn/cycle:

You really really should not check "sti" together with immediately
following sti or cli.

The sti instruction has an architecturally defined magical
one-instruction window following it when interrupts stay disabled. I
could easily see that resulting in strange special cases - Intel
actually at some point documented that a sequence of 'sti'
instructions are not going to disable interrupts forever (there was a
question of what happens if you start out with interrupts disabled, go
to a 16-bit code segment that is all filled with "sti" instructions so
that the 16-bit EIP will wrap around and continually do an infinite
series of 'sti' - do interrupts ever get enabled?)

I think intel clarified that when you have a sequence of 'sti'
instructions, interrupts will get enabled after the second one, but
the point is that this is all "special" from a front-end angle. So
putting multiple 'sti' instructions in a bunch may be testing the
magical special case more than it would test anything *real*.

So at a minimum, make the sequence be "sti; nop" if you do it in a
loop. It may not change anything, but at least that way you'll know it
doesn't just test the magical case.

Realistically, it's better to instead test a *real* instruction
sequence, ie just compare something like

pushf
cli
.. do a memory operation here or something half-way real ..
pop
sti

and

pushf
cli
.. do the same half-way real memory op here ..
popf

and see which one is faster in a loop.

That said, your numbers really aren't very convincing. If popf really
is just 10 cycles on modern Intel hardware, it's already fast enough
that I really don't think it matters. Especially with "sti" being ~4
cycles, and there being a question about branch overhead anyway. You
win some, you lose some, but on the whole it sounds like "leave it
alone" wins.

Now, I know for a fact that there have been other x86 uarchitectres
that sucked at "popf", but they may suck almost equally at "sti". So
this might well be worth testing out on something that isn't Skylake.

Modern intel cores really are pretty good at even the slow operations.
Things used to be much much worse in the bad old P4 days. I'm very
impressed with the big intel cores.

Linus