Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

From: Igor Matheus Andrade Torrente
Date: Tue Oct 19 2021 - 17:11:04 EST


Hi Pekka,

On 10/19/21 5:05 AM, Pekka Paalanen wrote:
On Mon, 18 Oct 2021 16:26:06 -0300
Igor Matheus Andrade Torrente <igormtorrente@xxxxxxxxx> wrote:

Hi Pekka,

On 10/18/21 5:30 AM, Pekka Paalanen wrote:
On Tue, 5 Oct 2021 17:16:37 -0300
Igor Matheus Andrade Torrente <igormtorrente@xxxxxxxxx> wrote:
Currently the blend function only accepts XRGB_8888 and ARGB_8888
as a color input.

This patch refactors all the functions related to the plane composition
to overcome this limitation.

Now the blend function receives a format handler to each plane and a
blend function pointer. It will take two ARGB_1616161616 pixels, one
for each handler, and will use the blend function to calculate and
store the final color in the output buffer.

These format handlers will receive the `vkms_composer` and a pair of
coordinates. And they should return the respective pixel in the
ARGB_16161616 format.

The blend function will receive two ARGB_16161616 pixels, x, y, and
the vkms_composer of the output buffer. The method should perform the
blend operation and store output to the format aforementioned
ARGB_16161616.

Hi,

here are some drive-by comments which you are free to take or leave.

To find more efficient ways to do this, you might want research library. It's MIT licensed.

IIRC, the elementary operations there operate on pixel lines (pointer +
length), not individual pixels (image, x, y). Input conversion function
reads and converts a whole line a time. Blending function blends two
lines and writes the output into back one of the lines. Output
conversion function takes a line and converts it into destination
buffer. That way the elementary operations do not need to take stride
into account, and blending does not need to deal with varying memory
alignment FWIW. The inner loops involve much less function calls and
state, probably.

I was doing some rudimentary profiling, and I noticed that the code
spends most of the time with the indirect system call overhead and not
the actual computation. This can definitively help with it.

Hi,

I suppose you mean function (pointer) call, not system call?

Yes, I misspelled it.


Really good that you have already profiled things. All optimisations
should be guided by profiling, otherwise they are just guesses (and I
got lucky this time I guess).

Pixman may not do exactly this, but it does something very similar.
Pixman also has multiple different levels of optimisations, which may
not be necessary for VKMS.

It's a trade-off between speed and temporary memory consumed. You need
temporary buffers for two lines, but not more (not a whole image in
full intermediate precision). Further optimisation could determine
whether to process whole image rows as lines, or split a row into
multiple lines to stay within CPU cache size.

Sorry, I didn't understand the idea of the last sentence.

If an image is very wide, a single row could still be relatively large
in size (bytes). If it is large enough that the working set falls out
of a faster CPU cache into a slower CPU cache (or worse yet, into RAM
accesses), performance may suffer and become memory bandwidth limited
rather than ALU rate limited. Theoretically that can be worked around
by limiting the maximum size of a line, and splitting an image row into
multiple lines.

Got it now, thanks.


However, this is an optimisation one probably should not do until there
is performance profiling data showing that it actually is useful. The
optimal line size limit depends on the CPU model as well. So it's a bit
far out, but something you could keep in mind just in case.

OK. For now I will not deal with this complexity, and if necessary I
will revisit it latter.


Since it seems you are blending multiple planes into a single
destination which is assumed to be opaque, you might also be able to do
the multiple blends without convert-writing and read-converting to/from
the destination between every plane. I think that might be possible to
architect on top of the per-line operations still.

I tried it. But I don't know how to do this without looking like a mess.

I don't think it changes anything, but I forgot to mention that I tried
it with the blend per pixel and not a per line.


Dedicate one of the temporary line buffers for the destination, and
instead of writing it out and loading it back for each input plane,
leave it in place over all planes and write it out just once at the end.

I do expect more state tracking needed. You need to iterate over the
list of planes for each output row, extract only the used part of an
input plane's buffer into the other temporary line buffer, and offset
the destination line buffer and length to match when passing those into
a blending function.+

It's exactly this state tracking that was a mess when I was trying to
implement something similar. But this is one more thing that probably
I will have to revisit later.


It's not an obvious win but a trade-off, so profiling is again needed.

Btw. the use of temporary line buffers should also help with
implementing scaling. You could do the filtering during reading of the
input buffer. If the filter is not nearest, meaning you need to access
more than one input pixel per pixel-for-blending, there are a few ways
to go about that. You could do the filtering during the input buffer
reading, or you could load two input buffer rows into temporary line
buffers and do filtering as a separate step into yet another line
buffer. As the composition advances from top to bottom, only one of the
input buffer rows will change at a time (during up-scaling) so you can
avoid re-loading a row by swapping the line buffers.

This is getting ahead of things, so don't worry about scaling or
filtering yet. The first thing is to see if you can make the line
buffer approach give you a significant speed-up.

Does the pixman perform some similar?

No, Pixman composition operation has only three images: source,
mask, and destination. All those it can handle simultaneously, but
since there is no "multi-blending" API, it doesn't need to take care of
more.

IIRC, Pixman also has a form of optimised operations that do blending
and converting to destination in the same pass. The advantage of that
is that blending can work with less precision when it knows what
precision the output will be converted to and it saves some bandwidth
by not needing to write-after-blending and read-for-conversion
intermediate precision values. The disadvantage is that the number of
needed specialised blending functions explodes by the number of
possible destination formats. Pixman indeed makes those specialised
functions optional, falling back to more generic C code. I would hope
that VKMS does not need to go this far in optimisations though.

This should be plenty fast indeed. Maybe worth for formats that are
extremely common.


Signed-off-by: Igor Matheus Andrade Torrente <igormtorrente@xxxxxxxxx>
---
drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
drivers/gpu/drm/vkms/vkms_formats.h | 125 ++++++++++++
2 files changed, 271 insertions(+), 129 deletions(-)
create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h

...
+
+u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
+{
+ u8 *pixel_addr = packed_pixels_addr(composer, x, y);
+
+ /*
+ * Organizes the channels in their respective positions and converts
+ * the 8 bits channel to 16.
+ * The 257 is the "conversion ratio". This number is obtained by the
+ * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
+ * the best color value in a color space with more possibilities.

Pixel format, not color space. >
+ * And a similar idea applies to others RGB color conversions.
+ */
+ return ((u64)pixel_addr[3] * 257) << 48 |
+ ((u64)pixel_addr[2] * 257) << 32 |
+ ((u64)pixel_addr[1] * 257) << 16 |
+ ((u64)pixel_addr[0] * 257);

I wonder if the compiler is smart enough to choose between "mul 257"
and "(v << 8) | v" operations... but that's probably totally drowning
under the overhead of per (x,y) looping.

I disassembled the code to check it. And looks like the compiler is
replacing the multiplication with shifts and additions.

ARGB8888_to_ARGB16161616:
0xffffffff816ad660 <+0>: imul 0x12c(%rdi),%edx
0xffffffff816ad667 <+7>: imul 0x130(%rdi),%esi
0xffffffff816ad66e <+14>: add %edx,%esi
0xffffffff816ad670 <+16>: add 0x128(%rdi),%esi
0xffffffff816ad676 <+22>: movslq %esi,%rax
0xffffffff816ad679 <+25>: add 0xe8(%rdi),%rax
0xffffffff816ad680 <+32>: movzbl 0x3(%rax),%ecx
0xffffffff816ad684 <+36>: movzbl 0x2(%rax),%esi
0xffffffff816ad688 <+40>: mov %rcx,%rdx
0xffffffff816ad68b <+43>: shl $0x8,%rdx
0xffffffff816ad68f <+47>: add %rcx,%rdx
0xffffffff816ad692 <+50>: mov %rsi,%rcx
0xffffffff816ad695 <+53>: shl $0x8,%rcx
0xffffffff816ad699 <+57>: shl $0x30,%rdx
0xffffffff816ad69d <+61>: add %rsi,%rcx
0xffffffff816ad6a0 <+64>: movzbl (%rax),%esi
0xffffffff816ad6a3 <+67>: shl $0x20,%rcx
0xffffffff816ad6a7 <+71>: or %rcx,%rdx
0xffffffff816ad6aa <+74>: mov %rsi,%rcx
0xffffffff816ad6ad <+77>: shl $0x8,%rcx
0xffffffff816ad6b1 <+81>: add %rsi,%rcx
0xffffffff816ad6b4 <+84>: or %rcx,%rdx
0xffffffff816ad6b7 <+87>: movzbl 0x1(%rax),%ecx
0xffffffff816ad6bb <+91>: mov %rcx,%rax
0xffffffff816ad6be <+94>: shl $0x8,%rax
0xffffffff816ad6c2 <+98>: add %rcx,%rax
0xffffffff816ad6c5 <+101>: shl $0x10,%rax
0xffffffff816ad6c9 <+105>: or %rdx,%rax
0xffffffff816ad6cc <+108>: ret

Nice!

+}
+
+u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
+{
+ u8 *pixel_addr = packed_pixels_addr(composer, x, y);
+
+ /*
+ * The same as the ARGB8888 but with the alpha channel as the
+ * maximum value as possible.
+ */
+ return 0xffffllu << 48 |
+ ((u64)pixel_addr[2] * 257) << 32 |
+ ((u64)pixel_addr[1] * 257) << 16 |
+ ((u64)pixel_addr[0] * 257);
+}
+
+u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
+{
+ __le64 *pixel_addr = packed_pixels_addr(composer, x, y);
+
+ /*
+ * Because the format byte order is in little-endian and this code
+ * needs to run on big-endian machines too, we need modify
+ * the byte order from little-endian to the CPU native byte order.
+ */
+ return le64_to_cpu(*pixel_addr);
+}
+
+/*
+ * The following functions are used as blend operations. But unlike the
+ * `alpha_blend`, these functions take an ARGB16161616 pixel from the
+ * source, convert it to a specific format, and store it in the destination.
+ *
+ * They are used in the `compose_active_planes` and `write_wb_buffer` to
+ * copy and convert one pixel from/to the output buffer to/from
+ * another buffer (e.g. writeback buffer, primary plane buffer).
+ */
+
+void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer)
+{
+ u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
+
+ /*This should be plenty fast indeed. Maybe worth for formats that are
extremely common.
+ * This sequence below is important because the format's byte order is
+ * in little-endian. In the case of the ARGB8888 the memory is
+ * organized this way:
+ *
+ * | Addr | = blue channel
+ * | Addr + 1 | = green channel
+ * | Addr + 2 | = Red channel
+ * | Addr + 3 | = Alpha channel
+ */
+ pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
+ pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
+ pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
+ pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);

This could be potentially expensive if the compiler ends up using an
actual div instruction.
Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the
"Faster div by 255" works to adapt to 16 bits.

But does the compiler actually do a div instruction? If not, then no
worries. If it does, then maybe something to look into, *if* this shows
up in profiling.

GCC isn't issuing any div instruction here.

convert_to_ARGB8888:
0xffffffff816ad770 <+0>: imul 0x12c(%rcx),%edx
0xffffffff816ad777 <+7>: mov %rcx,%rax
0xffffffff816ad77a <+10>: imul 0x130(%rcx),%esi
0xffffffff816ad781 <+17>: add %edx,%esi
0xffffffff816ad783 <+19>: movzwl %di,%edx
0xffffffff816ad786 <+22>: add 0x128(%rcx),%esi
0xffffffff816ad78c <+28>: add $0x100,%rdx
0xffffffff816ad793 <+35>: movslq %esi,%rcx
0xffffffff816ad796 <+38>: movabs $0xff00ff00ff00ff01,%rsi
0xffffffff816ad7a0 <+48>: add 0xe8(%rax),%rcx
0xffffffff816ad7a7 <+55>: mov %rdx,%rax
0xffffffff816ad7aa <+58>: mul %rsi
0xffffffff816ad7ad <+61>: shr $0x8,%rdx
0xffffffff816ad7b1 <+65>: mov %dl,(%rcx)
0xffffffff816ad7b3 <+67>: mov %rdi,%rdx
0xffffffff816ad7b6 <+70>: shr $0x10,%rdx
0xffffffff816ad7ba <+74>: movzwl %dx,%edx
0xffffffff816ad7bd <+77>: add $0x100,%rdx
0xffffffff816ad7c4 <+84>: mov %rdx,%rax
0xffffffff816ad7c7 <+87>: mul %rsi
0xffffffff816ad7ca <+90>: shr $0x8,%rdx
0xffffffff816ad7ce <+94>: mov %dl,0x1(%rcx)
0xffffffff816ad7d1 <+97>: mov %rdi,%rdx
0xffffffff816ad7d4 <+100>: shr $0x30,%rdi
0xffffffff816ad7d8 <+104>: shr $0x20,%rdx
0xffffffff816ad7dc <+108>: movzwl %dx,%edx
0xffffffff816ad7df <+111>: add $0x100,%rdx
0xffffffff816ad7e6 <+118>: mov %rdx,%rax
0xffffffff816ad7e9 <+121>: mul %rsi
0xffffffff816ad7ec <+124>: shr $0x8,%rdx
0xffffffff816ad7f0 <+128>: mov %dl,0x2(%rcx)
0xffffffff816ad7f3 <+131>: lea 0x100(%rdi),%rdx
0xffffffff816ad7fa <+138>: mov %rdx,%rax
0xffffffff816ad7fd <+141>: mul %rsi
0xffffffff816ad800 <+144>: shr $0x8,%rdx
0xffffffff816ad804 <+148>: mov %dl,0x3(%rcx)
0xffffffff816ad807 <+151>: ret



Thanks,
pq

Btw. this would be shorter to write as

(argb_src1 >> 16) & 0xffff

etc.
I will use it in the V2. Thanks.


Thanks,
pq
+}
+
+void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer)
+{
+ u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
+
+ pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
+ pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
+ pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
+ pixel_addr[3] = 0xff;
+}
+
+void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
+ struct vkms_composer *dst_composer)
+{
+ __le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
+
+ *pixel_addr = cpu_to_le64(argb_src1);
+}
+
+#endif /* _VKMS_FORMATS_H_ */