Re: [RFC] x86-64: Use SSE for copy_page and clear_page

From: Andi Kleen
Date: Mon May 30 2005 - 14:45:08 EST


> The SSE clear page fuction is almost twice as fast as the kernel's
> current clear_page, while the copy_page implementation is roughly a
> third faster. This is likely due to the fact that SSE instructions
> can keep the 256 bit wide L2 cache bus at a higher utilisation than
> 64 bit movs are able to. Comments?

Any use of write combining is wrong here because it forces
the destination out of cache, which causes performance issues later on.
Believe me we went through this years ago.

If you can code up a better function for P4 that does not use
write combining I would be happy to add. I never tuned the functions
for P4.

One simple experiment would be to just test if P4 likes the
simple rep ; movsq / rep ; stosq loops and enable them.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/