Re: [PATCH v3 2/3] lib/crypto: x86/sha1: PHE Extensions optimized SHA1 transform function
From: AlanSong-oc
Date: Thu Mar 05 2026 - 01:20:32 EST
On 1/18/2026 8:31 AM, Eric Biggers wrote:
>
> On Fri, Jan 16, 2026 at 03:15:12PM +0800, AlanSong-oc wrote:
>> Zhaoxin CPUs have implemented the SHA(Secure Hash Algorithm) as its CPU
>> instructions by PHE(Padlock Hash Engine) Extensions, including XSHA1,
>> XSHA256, XSHA384 and XSHA512 instructions.
>>
>> With the help of implementation of SHA in hardware instead of software,
>> can develop applications with higher performance, more security and more
>> flexibility.
>>
>> This patch includes the XSHA1 instruction optimized implementation of
>> SHA-1 transform function.
>>
>> Signed-off-by: AlanSong-oc <AlanSong-oc@xxxxxxxxxxx>
>
> Please include the information I've asked for (benchmark results, test
> results, and link to the specification) directly in the commit message.
Sorry for missing the link to the specification in the commit message.
I will include the benchmark results, test results, and the link to the
specification directly in the commit message in the next version of the
patch, rather than in the cover letter.
>> +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN)
>> +#define PHE_ALIGNMENT 16
>> +static void sha1_blocks_phe(struct sha1_block_state *state,
>> + const u8 *data, size_t nblocks)
>
> The IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) should go in the CPU feature
> check, so that the code will be parsed regardless of the setting. That
> reduces the chance that future changes will cause compilation errors.
I will address this problem using the approach you described below in
the next version of the patch.
>
>> + /*
>> + * XSHA1 requires %edi to point to a 32-byte, 16-byte-aligned
>> + * buffer on Zhaoxin processors.
>> + */
>
> This seems implausible. In 64-bit mode a pointer can't fit in %edi. I
> thought you mentioned that this instruction is 64-bit compatible? You
> may have meant %rdi.
>
> Interestingly, the spec you provided specifically says the registers
> operated on are %eax, %ecx, %esi, and %edi.
>
> So assuming the code works, perhaps both the spec and your code comment
> are incorrect?
>
> These errors don't really confidence in this instruction.
Sorry for the misleading comment. I will correct it in the next version
of the patch. The specification provided earlier uses the 32-bit
register as an example, which doesn't mean the instruction only supports
32-bit mode. The updated specification explicitly clarifies this point
and is available at the following link.
(https://gitee.com/openzhaoxin/zhaoxin_specifications/blob/20260227/ZX_Padlock_Reference.pdf)
>
>> + memcpy(dst, state, SHA1_DIGEST_SIZE);
>> + asm volatile(".byte 0xf3,0x0f,0xa6,0xc8"
>> + : "+S"(data), "+D"(dst)
>> + : "a"((long)-1), "c"(nblocks));
>> + memcpy(state, dst, SHA1_DIGEST_SIZE);
>
> Is the reason for using '.byte' that the GNU and clang assemblers don't
> implement the mnemonic this Zhaoxin-specific instruction? The spec
> implies that the intended mnemonic is "rep sha1".
>
> If that's correct, could you add a comment like /* rep sha1 */ so that
> it's clear what the intended instruction is?
The '.byte' directive is used to ensure the correct binary code is
generated, regardless of compiler support, particularly for compilers
that lack the corresponding mnemonic. I will add an appropriate comment
in the next version of the patch to clarify the intended instruction.
> Also, the spec describes all four registers as both input and output
> registers. Yet your inline asm marks %rax and %rcx as inputs only.
Thank you for pointing this question out.
On the one hand, when the '+' constraint modifier is applied to an
operand, it is treated as both an input and an output operand.
Therefore, %rsi and %rdi are considered input operands as well.
On the other hand, after the instruction executes, the values in %rax,
%rsi, and %rcx are modified. These registers should therefore use the
'+' constraint modifier to inform the compiler that their values are
updated by the assembly code. We cannot rely on clobbers to indicate
that the values of input operands are modified following the suggestion
by gcc manual. However, since %rax is initialized with a constant value,
it does not need the '+' constraint modifier. It should can simply be
specified as an input operand.
In addition, although %rdi itself is not modified by the instruction but
the memory it references may be updated, a "memory" clobber should be
added to notify the compiler about possible memory side effects.
The corrected inline assembly should be written as follows:
asm volatile(".byte 0xf3,0x0f,0xa6,0xc8" /* REP XSHA1 */
: "+S"(data), "+c"(nblocks)
: "a"((long)-1), "D"(dst)
: "memory");
I will adopt it in the next version of the patch.
>> @@ -59,6 +79,11 @@ static void sha1_mod_init_arch(void)
>> {
>> if (boot_cpu_has(X86_FEATURE_SHA_NI)) {
>> static_call_update(sha1_blocks_x86, sha1_blocks_ni);
>> +#if IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN)
>> + } else if (boot_cpu_has(X86_FEATURE_PHE_EN)) {
>> + if (boot_cpu_data.x86 >= 0x07)
>> + static_call_update(sha1_blocks_x86, sha1_blocks_phe);
>> +#endif
>
> I think it should be:
>
> } else if (IS_ENABLED(CONFIG_CPU_SUP_ZHAOXIN) &&
> boot_cpu_has(X86_FEATURE_PHE_EN) &&
> boot_cpu_data.x86 >= 0x07) {
> static_call_update(sha1_blocks_x86, sha1_blocks_phe);
>
> ... so (a) the code will be parsed even when !CONFIG_CPU_SUP_ZHAOXIN,
> and (b) functions won't be unnecessarily disabled when
> boot_cpu_has(X86_FEATURE_PHE_EN) && boot_cpu_data.x86 < 0x07).
Thanks for the suggestion, that makes more sense.
Using #if and #endif for conditional compilation is a poor choice, as it
prevents the code from being properly parsed. It is more efficient to
include CPU family checks directly in the condition.
>
> As before, all these comments apply to the SHA-256 patch too.
Surely, I will also apply all of the suggestions mentioned above to the
SHA-256 patch.
Please accept my apologies for the delayed response due to
administrative procedures and the recent holidays. Thank you for your
review and valuable suggestions.
Best Regards
AlanSong-oc