Re: [PATCH V3 3/5] drm/vkms: Decouple crc operations from composer

From: Simon Ser
Date: Thu Jul 11 2019 - 04:23:16 EST


On Thursday, July 11, 2019 11:19 AM, Daniel Vetter <daniel@xxxxxxxx> wrote:
> Aside: This all kinda doesn't go in the right direction for
> high-performance composing, so I guess I need to get started with typing
> up what that should look like.

Some related logs from IRC:

2019-07-10 19:42:49 danvet ickle, is there an idiot guide to reasonable fast blending/composing with the cpu?
2019-07-10 19:43:13 danvet for vkms, so maitainability is high on the wishlist, but it needs to be somewhat fast to be able to keep up
2019-07-10 19:44:02 ickle pixman for rectangles
2019-07-10 19:45:00 ickle if you want reasonably fast, you want simd
2019-07-10 19:45:16 ickle so better to use some prebaked library
2019-07-10 19:45:38 ickle but within vkms?
2019-07-10 19:45:48 ickle or just for igt/vkms?
2019-07-10 19:46:40 ickle within vkms for writeback blending I guess
2019-07-10 19:51:38 ickle http://paste.debian.net/1091097/
2019-07-10 19:53:08 ickle http://paste.debian.net/1091098/
2019-07-10 19:54:36 emersion danvet: what are your plans for the compositor refactoring?
2019-07-10 19:55:21 emersion are we still really using macros instead of functions
2019-07-10 19:56:22 <-- nvishwa1 (~nvishwa1@xxxxxxxxxxxx) has quit (Remote host closed the connection)
2019-07-10 19:56:44 emersion is this coming from pixman, ickle?
2019-07-10 19:56:50 --> karolherbst (~kherbst@2a02:8308:b0be:6900:d9e8:6dcd:2f6a:6cb5) has joined #intel-gfx
2019-07-10 19:57:23 ickle yes, take it as a reference on how to do abgr32 premultiplied alpha blending
2019-07-10 19:57:33 <-- amathes (ajmathes@nat/intel/x-ywywuftprrqndbaj) has quit (Ping timeout: 245 seconds)
2019-07-10 19:57:48 emersion yeah, that's a good idea
2019-07-10 19:57:49 ickle or argb32 actually
2019-07-10 19:58:01 emersion (asking for the source because license)
2019-07-10 19:58:04 ickle MIT
2019-07-10 19:58:07 emersion nice
2019-07-10 20:00:12 <-- sandeep (sandeep@nat/intel/x-buwbzkgopszvwbcr) has quit (Remote host closed the connection)
2019-07-10 20:01:34 danvet ickle, yeah vkms in the kernel
2019-07-10 20:02:41 danvet also might need more than 8 bits ...
2019-07-10 20:02:57 danvet and kinda hoped I could just tell gcc to simdify it for me
2019-07-10 20:03:23 danvet emersion, compositor refactoring
2019-07-10 20:03:46 danvet ickle, higher level I figured something like a fetch fifo in a standard format
2019-07-10 20:04:02 danvet with some drm_format->standard format conversion tools
2019-07-10 20:04:08 danvet and then one blender
2019-07-10 20:04:39 danvet and then either add that to the crc and toss it (again maybe scanline-by-scanline, or whatever fits reasonable into l1$ all together)
2019-07-10 20:04:44 danvet or dump it into writeback
2019-07-10 20:08:25 danvet https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html <- this stuff essentially, using generics
2019-07-10 20:08:38 danvet *generic intrinsics
2019-07-10 20:08:49 danvet or is that going to be real awful?
2019-07-10 20:09:35 ickle shouldn't be required for _basic_ blending
2019-07-10 20:09:52 danvet yeah I think all we want is premultiplied alpha
2019-07-10 20:09:53 ickle if all you need is an over operator, then gcc should be pretty good
2019-07-10 20:09:57 <-- sdutt (sdutt@nat/intel/x-zhquyrigdztslyqh) has quit (Ping timeout: 268 seconds)
2019-07-10 20:09:59 danvet maybe some yuv->rgb
2019-07-10 20:10:19 danvet expanding from whatever silly format we have to the right vector
2019-07-10 20:10:33 emersion what would be your universal format?
2019-07-10 20:10:46 danvet a16r16b16g16
2019-07-10 20:10:50 danvet except if gcc barfs on that
2019-07-10 20:11:02 danvet or maybe go all in on 4x float :-)
2019-07-10 20:11:10 emersion eh
2019-07-10 20:11:29 emersion fp16 would work, but would also mean rounding errors, probably?
2019-07-10 20:11:35 danvet uint16 is not going to be awesome for hdr, but good enough for everything else
2019-07-10 20:11:42 danvet no cpu has fp16
2019-07-10 20:11:52 emersion ah, fp32
2019-07-10 20:11:57 krh wait for avx1024
2019-07-10 20:12:01 emersion seems kind of overkill
2019-07-10 20:12:01 danvet fg32 is probably fastest option we can get on common hw
2019-07-10 20:12:12 danvet krh, very much aiming for good enough here
2019-07-10 20:12:26 ickle one plan is to sneak ksim into the kernel as generic gpu-on-x86
2019-07-10 20:12:52 krh can the kernel use avx2?
2019-07-10 20:13:05 danvet well, all stuff I'd need to figure out
2019-07-10 20:13:07 danvet I hope so
2019-07-10 20:13:12 ickle it can
2019-07-10 20:13:19 ickle easier than mmx
2019-07-10 20:13:49 emersion when you say gcc barfs on a16r16b16g16: why?
2019-07-10 20:13:57 danvet kernel_fpu_begin/end + telling gcc to optimize the crap out of the file with the blending functions
2019-07-10 20:14:11 emersion uint64 too hard for gcc to optimize?
2019-07-10 20:14:14 danvet emersion, I haven't checked, but if it generates silly code then might be better to go with fp32
2019-07-10 20:14:28 vsyrjala the compiler always generates silly code
2019-07-10 20:14:30 danvet ideally it should all boil down to sse/avx
2019-07-10 20:14:32 emersion ahah
2019-07-10 20:14:51 danvet and ideally all with generic intrinsics so the arm folks don't freak out
2019-07-10 20:15:07 --> sandeep (~sandeep@xxxxxxxxxxxxxx) has joined #intel-gfx
2019-07-10 20:15:23 emersion +1 for generic intrinsics
2019-07-10 20:15:42 danvet the conversion from uint to fp32 might be hilarious
2019-07-10 20:15:50 emersion yeah, probably
2019-07-10 20:15:52 danvet perhaps the one place where we want to use an sse or avx intrinsic
2019-07-10 20:16:19 danvet especially if we convert to simd16 instead of something like 4x4
2019-07-10 20:16:22 emersion we should probably do some little experiments before doing anything
2019-07-10 20:16:24 danvet simd4x4
2019-07-10 20:16:31 danvet yeah
2019-07-10 20:16:41 danvet that's why I'm asking here, since I have roughly 0 clue about this
2019-07-10 20:17:48 danvet I don't think we ever need a dot or anything like that, so plain simd is propably best
2019-07-10 20:18:07 danvet except the input is usually 4x or 3x vectors
2019-07-10 20:18:07 emersion a dot?
2019-07-10 20:18:12 danvet dot product
2019-07-10 20:18:14 emersion ah
2019-07-10 20:18:15 danvet for vertex shaders
2019-07-10 20:18:20 emersion yeah, probably not
2019-07-10 20:18:21 bwidawsk danvetâ btw, I think "generic intrinsic" is an ARM thing
2019-07-10 20:18:29 bwidawsk I think everywhere else, they just say intrinsic
2019-07-10 20:18:41 bwidawsk at least, I've never heard the generic prefix other than ARM compiler
2019-07-10 20:19:19 vsyrjala just make the max supported resolution 8x8 or something and speed shouldn't be a huge issue
2019-07-10 20:19:30 emersion i think he meant generic intrinsic vs. sse/avx/whatever
2019-07-10 20:19:49 bwidawsk emersionâ yes, I figured out what he meant, I just mentioned it because it was a source of confusion
2019-07-10 20:19:52 emersion vsyrjala, helpful as always :)
2019-07-10 20:20:32 danvet oh gcc has all the casting implementing in the intrinsics too
2019-07-10 20:20:50 danvet bwidawsk, gcc manpage also calls them generic intrinsics
2019-07-10 20:21:25 bwidawsk danvetâ not my gcc manpage
2019-07-10 20:21:40 danvet well "generic vector operations"
2019-07-10 20:21:46 danvet ^^ that what you meant?
2019-07-10 20:22:03 danvet vs "machine-specific vector intrinsics"
2019-07-10 20:22:38 bwidawsk I thought the generic intrinsic term came from ARM's proprietary compiler
2019-07-10 20:22:40 bwidawsk but I might be wrong