No, the fastest, and shortest way to do it is
.. dest in %edi ..
movl $1024,%ecx
xorl %eax,%eax
rep; stosl
Which is in fact exactly how linux does it..
Of course, if you have an old x86 chip, that's your problem and you may
not get optimal performance, but who expected anything else from old
hardware?
(Hint: the above _really_ flies on a PPro. Intel optimized it to do
cache-line accesses, it seems. They did the right thing)
Linus