Re: Stateless Encoding uAPI Discussion and Proposal

From: Hsia-Jun Li
Date: Tue Aug 22 2023 - 23:05:48 EST

Next message: Herbert Xu: "Re: [PATCH] crypto: Explicitly include correct DT includes"
Previous message: Zhijian Li (Fujitsu): "Re: [PATCH 1/2] RDMA/rxe: add missing newline to rxe_set_mtu message"
In reply to: Nicolas Dufresne: "Re: Stateless Encoding uAPI Discussion and Proposal"
Next in thread: Paul Kocialkowski: "Re: Stateless Encoding uAPI Discussion and Proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 8/23/23 04:31, Nicolas Dufresne wrote:

CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi,

[...]

In cable streaming notably, the RC job is to monitor the about of bits over a
period of time (the window). This window is defined by the streaming hardware
buffering capabilities. Best at this point is to start reading through HRD
specifications, and open source rate control implementation (notably x264).

I think overall, we can live with adding hints were needed, and if the gop
information is appropriate hint, then we can just reuse the existing control.

Why we still care about GOP here. Hardware have no idea about GOP at
all. Although in codec likes HEVC, IDR and intra pictures's nalu header
is different, there is not different in the hardware coding
configration. NALU header is generated by the userspace usually.

While future encoding would regard the current encoded picture as an IDR
is completed decided by the userspace.

The discussion was around having basic RC algorithm in the kernel driver,

What I am thinking is who would use a basic RC algorithm in the kernel?
We are designing a toy algorithm which all hardware could use, while it would introduce a complex structure to make the userspace work with it.

Vendor would need to try to fit their model in an interface with limited functions.

possibly making use of hardware specific features without actually exposing it
all to userspace. So assuming we do that:

Paul's concern is that for best result, an RC algorithm could use knowledge of
keyframe placement to preserve bucket space (possibly using the last keyframe
size as a hint). Exposing the GOP structure in some form allow "prediction", so
the adaption can lookahead future budget without introducing latency. There is
an alternative, which is to require ahead of time queuing of encode requests.

It sounds like a fixed bitrate RC. Then this RC algorithm would in charge of selecting the reference frames?

Suppose we are talking about Hantro H1 which people here are familiar with.
An intra frame would usually cost the most hardware time to encode and contribute a lot to the size of a GOP(fixed bitrate).

If we ignore the inter frame, that would lead to a bad quality image.
One case here is decide whether I would use a previous intra frame as the reference or just the last frame
Userspace should be able to decide when to request a intra frame or reencode the current inter frame to intra frame.

But this does introduce latency since the way it works in V4L2 today, we need
the picture to be filled by the time we request an encode.

Though, if we drop the GOP structure and favour this approach, the latency could
be regain later by introducing fence base streaming. The technique would be for
a video source (like a capture driver) to pass dmabuf that aren't filled yet,
but have a companion fence. This would allow queuing requests ahead of time, and
all we need is enough pre-allocation to accommodate the desired look ahead. Only
issue is that perhaps this violates the fundamental of "short term" delivery of
fences. But fences can also fail I think, in case the capture was stopped.

I don't think it would help. Fence is a thing for DRM/GPU without a queue.
Even with a fence, would the video sink tell us the motion delta here?

We can certainly move forward with this as a future solution, or just don't
implement future aware RC algorithm in term to avoid the huge task this involves
(and possibly patents?)

I think we should not restrict how the userspace(vendor) operate the hardware.

[...]

Of course, the subject is much more relevant when there is encoders with more
then 1 reference. But you are correct, what the commands do, is allow to change,
add or remove any reference from the list (random modification), as long as they
fit in the codec contraints (like the DPB size notably). This is the only way
one can implement temporal SVC reference pattern, robust reference trees or RTP
RPSI. Note that long term reference also exists, and are less complex then these
commands.

If we the userspace could manage the lifetime of reconstruction
buffers(assignment, reference), we don't need a command here.

Sorry if I created confusion, the comments was something specific to H.264
coding. Its a compressed form for the reference lists. This information is coded
in the slice header and enabled through adaptive_ref_pic_marking_mode_flag

It was suggested so far to leave h264 slice headers writing to the driver. This
is motivated by H264 slice header not being byte aligned in size, so the

H.264, H.265 has the byte_alignment() in nalu. You don't need skip bits feature which could be found in H1.

slice_data() is hard to combine. Also, some hardware actually produce the
slice_header. This needs actual hardware interface analyses, cause an H.264
slice header is worth nothing if it cannot instruct the decoder how to maintain
the desired reference state.

I don't even think we should write the slice header into the CAPTURE buffer, which would cause a cache problem. Ususally the slice header would be written only when that slice data is copy out.
That is much more easily that userspace wrapper handle this.

I think this aspect should probably not be generalized to all CODECs, since the
packing semantic can largely differ. When the codec header is indeed byte
aligned, it can easily be seperate and combined by application, improve the
application flexibility, reducing the kernel API complexity.

It is just a problem of how to design another request API control
structure to select which buffers would be used for list0, list1.

I this raises a big question, and I never checked how this worked with let's say
VA. Shall we let the driver resolve the changes into commands (VP8 have
something similar, while VP9 and AV1 are refresh flags, which are just trivial
to compute). I believe I'll have to investigate this further.

[...]

regards,
Nicolas

--
Hsia-Jun(Randy) Li

Next message: Herbert Xu: "Re: [PATCH] crypto: Explicitly include correct DT includes"
Previous message: Zhijian Li (Fujitsu): "Re: [PATCH 1/2] RDMA/rxe: add missing newline to rxe_set_mtu message"
In reply to: Nicolas Dufresne: "Re: Stateless Encoding uAPI Discussion and Proposal"
Next in thread: Paul Kocialkowski: "Re: Stateless Encoding uAPI Discussion and Proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]