Re: [PATCH 0/4] amd64_edac: misc fixes

From: Borislav Petkov
Date: Mon Jun 01 2009 - 10:53:43 EST


> Obviously not, since it's a relatively new opcode. However, it is
> supported by both Intel and AMD with the opcode F3 0F B8 /r.
>
> The "/r" is the real problem ... it means one can't just mimic it with
> hard-coding .byte directives without fixing the arguments (which means a
> performance hit.) Furthermore, the 0F B8 opcode is JMPE, which doesn't
> take the same arguments either.

How about we pin the src/dst into a register:

#define popcnt_spelled(x) \
({ \
typeof(x) __ret; \
__asm__(".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t" \
".byte 0xb8\n\t.byte 0xc0\n\t" \
: "=a" (__ret) \
: "0" (x)); \
__ret; \
})

which generates

40055e: 48 8b 45 e8 mov -0x18(%rbp),%rax
400562: f3 48 0f b8 c0 popcnt %rax,%rax
400567: 48 89 45 f8 mov %rax,-0x8(%rbp)

here.

For < 64bit operand sizes, the operands get zero-extended so that
garbage in the high 32/48 bits of %rax doesn't corrupt the result.
We might even want to do the movzwq explicitly so that some compiler
doesn't decide to take the version with the "0f b6" opcode which
zero-extends only the 16-/32-bit register. This way, you can popcnt even
single bytes although the popcnt implementation doesn't allow single
byte operands.

400572: 0f b7 45 f2 movzwl -0xe(%rbp),%eax
400579: f3 48 0f b8 c0 popcnt %rax,%rax
40057e: 66 89 45 f6 mov %ax,-0xa(%rbp)


So, in addition to popcnt itself, we have two movs added. This is still
less than the 30+ ops (+ function call overhead) that hweight* get
translated into. I'll redo my kernel build benchmarks tomorrow to get
some more recent numbers on the performance gain.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/