Yes, adcl has to go in the U-pipe. In theory it could go faster by
using another register to make everything pair nicely:
movl (%%esi), %%ebx
movl 4(%%esi), %%ecx
adcl %%ebx, %%eax
movl %%ebx, (%%edi)
adcl %%ecx, %%eax
movl %%ecx, 4(%%edi)
But theory doesn't mean diddly here. Do it and time it. Ideally,
%ebp would be available but that assumes -fomit-frame-pointer and may
have had some other problems when I tried it.
Some history: csum_partial_copy() didn't used to exist. In the
beginning (well, of 1.3 anyway) there was csum_partial_copy_fromuser(),
which used an inner loop similar to the above. But, the segment
overrides botched the pipelining beyond redemption (six instructions,
four of which must go in the U-pipe, leaving two to overlap and two
empty V-pipe positions) so I tweaked it to its present form which
didn't require push/pop of %ecx. Then, csum_partial_copy() was
(apparently) derived from csum_partial_copy_fromuser() by the obvious
simplification of omitting fs: without regard to Pentium pipelining.
So go for it!
Tom.