Re: [GIT PULL] x86/build changes for v4.17

From: Ingo Molnar
Date: Thu Apr 05 2018 - 04:05:07 EST



* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Wed, Apr 04, 2018 at 05:05:25PM -0700, Linus Torvalds wrote:
> > for some reason the test_bit() case looks like
> > this:
> >
> > #define test_bit(nr, addr) \
> > (__builtin_constant_p((nr)) \
> > ? constant_test_bit((nr), (addr)) \
> > : variable_test_bit((nr), (addr)))
> >
> > which is much more straightforward anyway. I'm not quite sure why we
> > did it that odd way anyway, but I bet it's just "hysterical raisins"
> > along with the test_bit() not needing inline asm at all for the
> > constant case.
>
> I always assumed BT was a more expensive instruction than AND with
> immediate.

According to:

http://www.agner.org/optimize/instruction_tables.pdf

The SkyLake costs for 'BT', 'AND' and 'TEST' variants are:

Instruction Operands uops fused uops unfused uops port latency throughput
BT r,r/i 1 1 p06 1 0.5
BT m,r 10 10 5
BT m,i 2 2 p06 p23 0.5
BTR BTS BTC r,r/i 1 1 p06 1 0.5
BTR BTS BTC m,r 10 11 5
BTR BTS BTC m,i 3 4 p06 p4 p23 1
AND OR XOR r,r/i 1 1 p0156 1 0.25
AND OR XOR r,m 1 2 p0156 p23 0.5
AND OR XOR m,r/i 2 4 2p0156 2p237 p4 5 1
TEST r,r/i 1 1 p0156 1 0.25
TEST m,r/i 1 2 p0156 p23 1 0.5


So if I'm reading it right, the relevant comparison would be:

BT m,i 2 2 p06 p23 0.5
AND OR XOR m,r/i 2 4 2p0156 2p237 p4 5 1

?

Thanks,

Ingo