Re: RFC: Petition Intel/AMD to add POPF_IF insn

From: Christian KÃnig
Date: Wed Aug 17 2016 - 16:08:20 EST


Well first of all you actually CCed AMDs graphics developers. So Alex and I can't say much about CPU optimizations.

Additional to that a comment below.

Am 17.08.2016 um 19:20 schrieb Denys Vlasenko:
Last year, a proposal was floated to avoid costly POPF.
In particular, each spin_unlock_irqrestore() does it,
a rather common operation.

https://lkml.org/lkml/2015/4/21/290
[RFC PATCH] x86/asm/irq: Don't use POPF but STI

* Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> Another different approach would be to formally state that
> pv_irq_ops.save_fl() needs to return all the flags, which would
> make local_irq_save() safe to use in this circumstance, but that
> makes a hotpath longer for the sake of a single boot time check.

...which reminds me:

Why does native_restore_fl restore anything other than IF? A branch
and sti should be considerably faster than popf.


Ingo agreed:
====
Yes, this has come up in the past, something like the patch below?

Totally untested and not signed off yet: because we'd first have to
make sure (via irq flags debugging) that it's not used in reverse, to
re-disable interrupts:
local_irq_save(flags);
local_irq_enable();
...
local_irq_restore(flags); /* effective local_irq_disable() */
I don't think we have many (any?) such patterns left, but it has to be
checked first. If we have such cases then we'll have to use different
primitives there.
====

Linus replied:
=====
"popf" is fast for the "no changes to IF" case, and is a smaller
instruction anyway.
====


This basically shot down the proposal.

But in my measurements POPF is not fast even in the case where restored
flags are not changes at all:

mov $200*1000*1000, %eax
pushf
pop %rbx
.balign 64
loop: push %rbx
popf
dec %eax
jnz loop

# perf stat -r20 ./popf_1g
4,929,012,093 cycles # 3.412 GHz ( +- 0.02% )
835,721,371 instructions # 0.17 insn per cycle ( +- 0.02% )
1.446185359 seconds time elapsed ( +- 0.46% )

If I replace POPF with a pop into an unused register, I get this:

You are comparing apples and bananas here.

The microcode path in the CPUs will probably through away the read if you don't actually use the register value.

Regards,
Christian.


loop: push %rbx
pop %rcx
dec %eax
jnz loop

209,247,645 cycles # 3.209 GHz ( +- 0.11% )
801,898,016 instructions # 3.83 insn per cycle ( +- 0.00% )
0.066210725 seconds time elapsed ( +- 0.59% )


IOW, POPF takes at least 6 cycles.


Linus does have a point that a "test+branch+STI" may end up not a clear win because of the branch.

But the need to restore IF flag exists, it is necessary not only for Linux, but for any OS
running on x86: they all have some sort of spinlock.

The addition of a POPF instruction variant which looks only at IF bit and changes
only that bit in EFLAGS may be a good idea, for all OSes.

I propose that we ask Intel / AMD to do that.

Maybe by the "ignored prefix" trick which was used when LZCNT insn was introduced
as REPZ-prefixed BSR?
Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as
POPF_IF. Then the same kernel will work on old and new CPUs.

CC'ing some @intel and @amd emails...