>It might also be your video chip, BTW. It may not handle 64-bit
>transfers (over the 32-bit PCI bus) very efficiently.
Hmmm... I would guess that the X video transfers take place in
user code in the X server, and are unaffected by the FPU memcpy patch.
I assume the patch affects the transfers between the X client and
server (with no SHM extension in use). I assume it affects lots of
non-X system calls, and that might affect the timing!
About the only thing it gets triggered for is page copies and the
like. That was by intent. So this is an interesting mystery.
I was thinking that the test against the 1K limit (etc.)
could be inlined into the caller, and, with luck, optimized away in
some cases. I haven't looked very hard at the code, but it seems to
me that without the FPU memcpy patch, __generic_memcpy_fromfs and
_generic_memcpy_tofs were inlined. With the patch, they are externed.
So, what I propose is more like:
__generic_memcpy_fromfs is only used when the compiler doesn't know
how big the copy is. I figure if the compiler really doesn't know
then there's no good reason to inline anything here. Maybe you've
stumbled across a bad case (something's doing register-sized memcpy's
and not using a constant to specify the size -- yuk).
+inline void
+__generic_memcpy_fromfs(void *to, const void *from, size_t bytes)
+{
+ if (bytes == 0)
+ goto out;
+ if ((bytes >= 1024) && ALIGNED(to, 8) && ALIGNED(from, 8) && ALIGNED(bytes,25
6))
+ ___xcopy_fromfs (to, from, bytes);
+ else
+ ___memcpy_fromfs(to, from, bytes);
+ out:
+}
Similar changes would apply for __memcpy_g and __memcpy_tofs.
Even still, the overhead of the inline code cause undesireable kernel
bloat.
I found that the kernel bloat just from the standard inline memcpy was
substantial (something like 20K, I think), and the performance gain
was not measurable.
-- Robert Krawitz <rlk@tiac.net> http://www.tiac.net/users/rlk/Member of the League for Programming Freedom -- mail lpf@uunet.uu.net Tall Clubs International -- tci-request@aptinc.com or 1-800-521-2512