rep_good will cause memcpy jump to memcpy_c, so not run this patch,
we may continue to do further optimization on it later.
Yes, but in fact, the performance of memcpy_c is not better on some micro-architecture(such as:
Wolfdale-3M, ), especially in the unaligned cases, so we need do optimization for it, and I think
the first step of optimization is optimizing the original code of memcpy().
As mentioned above , we will optimize further memcpy_c soon.
Two reasons :
1. movs instruction need long lantency to startup
2. movs instruction is not good for unaligned case.
BTW the improvement is only from core2 shift register optimization,
but for most previous cpus shift register is very sensitive because of decode stage.
I have test Atom, Opteron, and Nocona, new patch is still better.
I think we can add a flag to make this improvement only valid for Core2 or other CPU like it,
just like X86_FEATURE_REP_GOOD.
We should optimize core2 in memcpy_c function in future, I think.