"Richard B. Johnson" wrote:
>
> With intel processors, the 'rep' before an instruction will not
> execute that instruction if ecx is already zero. You do not
> have to test. Also, a jump is often much more harmful in instruction
> time than straight-through instruction. For instance, the fastest
> 486 code for an unaligned copy is:
>
> movl SRC(%esp), %esi
> movl DST(%esp), %edi
> movl CNT(%esp), %ecx
> shrl $1,%ecx
> rep movsw
> adcl %ecx,%ecx
> rep movsb
Agreed. But most of the time we are memseting or memcopying memory
regions that are aligned in compile time or aligned by kmalloc.
In both cases alignment is 4 or other higher power of 2 value.
Which make such code redundant.
> If it's longword aligned, i.e., both source and destination addresss
> are clear in their low two bits, moving longwords through the edx
> register, with eax and ebx being the index registers, is faster, even with
> a beginning test for longword size.
>
> movl SRC(%esp), %eax
> movl DST(%esp), %ebx
> movl CNT(%esp), %ecx
> testl $3, %ecx
> jz 2f
> shrl $2, %ecx # long words CY set if an extra word
> 1: movl (%eax), %edx # Do NOT touch EAX in the next instruction
> movl %edx, (%ebx) # Do NOT touch EBX in the next instruction
> leal 4(%eax), %eax # Adjust EAX index now
> leal 4(%ebx), %ebx # Adjust EBX index now
> decl %ecx # does not change CY
> jnz 1b
>
> 2:
>
> To be able to run some instructions in parallel, you have to follow the
> idea shown in the above comments, i.e., don't touch an index register
> in the instructions immediately following its use to address memory.
>
> This will allow the memory access to occur during the parallel execution
> of the next instruction(s).
I made such a mistake in memcpy - i added 4 to register used in last
register for memory reference.
I'm not so sure about placing "decl" between two "leal"s. I am using
"addl" which is supposed to go through V pipe (at least on 586), just
as "decl" can.
Anyway I'll make some performance tests on an old 486 i have.
> The decl %ecx should be put BETWEEN the two `leal` instructions so that
> the address calculation can occur in parallel with the register operation.
> LEA does not affect the flags. In the example above I didn't do this
> because it makes the code unclear.
>
> Various registers used as index registers are not all the same. Register
> EAX was not an index register in i386 machines. It became one in i486
> machines. It is faster to use (%eax) than (%ebx).
Right. This is inherited from earlier '86 CPUs where "ax" was the
accumulator - that's why many arithmetic operations generate smaller
code when the target is ax/eax.
best,
Petkan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Thu Aug 31 2000 - 21:00:27 EST