Have you actually read the timings? On the 486 and up, it's faster to
do:
move.b %al, (%edi)
inc.l %edi
than it is to do
stos.b
And you get to pick which index register you can use if you use the
separate instructions, which is much better for register scheduling.
The only time you _might_ win is if you use the repeated lods/stos/movs
instructions for a large fill/copy. If you are using individual lods or
stos instructions as part of a bigger loop you're slowing down your code.