Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs()optimization for 8x* fonts

From: Knut Petersen
Date: Wed Aug 31 2005 - 14:17:00 EST


Hi Roman!

+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
u32 s_pitch, u32 height)
+{
+ int i, j;
+
+ if (likely(s_pitch==1))
+ for(i=0; i < height; i++)
+ dst[d_pitch*i] = src[i];



I added the multiply back because gcc (v. 3.3.4) does generate the fastest code
if I write it this way. I compiled, inspected the generated assembly and benchmarked
about a dozend variations of the code (benchmark as previously described).

The special case for s_pitch == 1 saves about 10 ms system time (770 ms -> 760 ms)

The special case for s_pitch == 2 saves about 270 ms system time (2120 -> 1850ms)
with a 16x30 font.

The third case is for even bigger fonts ... I believe that it will not often be used but
something like that must be present.

You have now 3 slightly different variants of the same, which isn't really an improvement. In my example I showed you how to generate the first and last version from the same source.

The first version will only be generated when gcc can be sure that s_pitch is 1.
Therefore you had to explicitly call __fb_pad_aligned_buffer with that value:

if (likely(idx == 1))
__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

With the version I propose it´s enough to write

__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

instead, and you will get good performance for all cases. If the value of
idx/s_pitch is know at compile time, the compiler can and will ignore the
other cases.

If you also want to optimize for other sizes, you might want to always inline the function, if the function call overhead is the largest part anyway, the special case for 2 bytes might not be needed anymore.


fb_pad_aligned_buffer() is usefull to save some space in cases like softcursor.
It´s also used by some drivers (nvidia and riva), but the authors of those drivers
have to decide if they prefer the inlined version or the version fixed in fbmem.

BTW this version saves another condition:

static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
{
int i, j;

d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j > 0);
dst += d_pitch;
} while (--i > 0);
}

I tested that code, together with the followig code in but_putcs():

if(idx==1)
__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
else if (idx==2)
__fb_pad_aligned_buffer(dst, pitch, src, 2, image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
dst += width;

It´s as fast/slow as your previous version, the measurements are almost identical.

Let´s summarize:

Your version of __fb_pad_aligned_buffer looks much better, but it needs not so nice
conditionals when used. My version looks bad, but it is easier to use and it is
_faster_.

cu,
knut



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/