Re: [PATCH v3 2/2] media: docs-rst: Document memory-to-memory video encoder interface

From: Tomasz Figa
Date: Mon Apr 08 2019 - 05:35:19 EST


On Mon, Apr 8, 2019 at 4:43 PM Hans Verkuil <hverkuil@xxxxxxxxx> wrote:
>
> On 4/8/19 8:59 AM, Tomasz Figa wrote:
> > On Thu, Mar 21, 2019 at 7:11 PM Hans Verkuil <hverkuil@xxxxxxxxx> wrote:
> >>
> >> Hi Tomasz,
> >>
> >> A few more comments:
> >>
> >> On 1/24/19 11:04 AM, Tomasz Figa wrote:
> >>> Due to complexity of the video encoding process, the V4L2 drivers of
> >>> stateful encoder hardware require specific sequences of V4L2 API calls
> >>> to be followed. These include capability enumeration, initialization,
> >>> encoding, encode parameters change, drain and reset.
> >>>
> >>> Specifics of the above have been discussed during Media Workshops at
> >>> LinuxCon Europe 2012 in Barcelona and then later Embedded Linux
> >>> Conference Europe 2014 in DÃsseldorf. The de facto Codec API that
> >>> originated at those events was later implemented by the drivers we already
> >>> have merged in mainline, such as s5p-mfc or coda.
> >>>
> >>> The only thing missing was the real specification included as a part of
> >>> Linux Media documentation. Fix it now and document the encoder part of
> >>> the Codec API.
> >>>
> >>> Signed-off-by: Tomasz Figa <tfiga@xxxxxxxxxxxx>
> >>> ---
> >>> Documentation/media/uapi/v4l/dev-encoder.rst | 586 ++++++++++++++++++
> >>> Documentation/media/uapi/v4l/dev-mem2mem.rst | 1 +
> >>> Documentation/media/uapi/v4l/pixfmt-v4l2.rst | 5 +
> >>> Documentation/media/uapi/v4l/v4l2.rst | 2 +
> >>> .../media/uapi/v4l/vidioc-encoder-cmd.rst | 38 +-
> >>> 5 files changed, 617 insertions(+), 15 deletions(-)
> >>> create mode 100644 Documentation/media/uapi/v4l/dev-encoder.rst
> >>>
> >>> diff --git a/Documentation/media/uapi/v4l/dev-encoder.rst b/Documentation/media/uapi/v4l/dev-encoder.rst
> >>> new file mode 100644
> >>> index 000000000000..fb8b05a132ee
> >>> --- /dev/null
> >>> +++ b/Documentation/media/uapi/v4l/dev-encoder.rst
> >>> @@ -0,0 +1,586 @@
> >>> +.. -*- coding: utf-8; mode: rst -*-
> >>> +
> >>> +.. _encoder:
> >>> +
> >>> +*************************************************
> >>> +Memory-to-memory Stateful Video Encoder Interface
> >>> +*************************************************
> >>> +
> >>> +A stateful video encoder takes raw video frames in display order and encodes
> >>> +them into a bitstream. It generates complete chunks of the bitstream, including
> >>> +all metadata, headers, etc. The resulting bitstream does not require any
> >>> +further post-processing by the client.
> >>> +
> >>> +Performing software stream processing, header generation etc. in the driver
> >>> +in order to support this interface is strongly discouraged. In case such
> >>> +operations are needed, use of the Stateless Video Encoder Interface (in
> >>> +development) is strongly advised.
> >>> +
> >>> +Conventions and notation used in this document
> >>> +==============================================
> >>> +
> >>> +1. The general V4L2 API rules apply if not specified in this document
> >>> + otherwise.
> >>> +
> >>> +2. The meaning of words "must", "may", "should", etc. is as per `RFC
> >>> + 2119 <https://tools.ietf.org/html/rfc2119>`_.
> >>> +
> >>> +3. All steps not marked "optional" are required.
> >>> +
> >>> +4. :c:func:`VIDIOC_G_EXT_CTRLS` and :c:func:`VIDIOC_S_EXT_CTRLS` may be used
> >>> + interchangeably with :c:func:`VIDIOC_G_CTRL` and :c:func:`VIDIOC_S_CTRL`,
> >>> + unless specified otherwise.
> >>> +
> >>> +5. Single-planar API (see :ref:`planar-apis`) and applicable structures may be
> >>> + used interchangeably with multi-planar API, unless specified otherwise,
> >>> + depending on decoder capabilities and following the general V4L2 guidelines.
> >>
> >> decoder -> encoder
> >>
> >
> > Ack.
> >
> >>> +
> >>> +6. i = [a..b]: sequence of integers from a to b, inclusive, i.e. i =
> >>> + [0..2]: i = 0, 1, 2.
> >>> +
> >>> +7. Given an ``OUTPUT`` buffer A, then Aâ represents a buffer on the ``CAPTURE``
> >>> + queue containing data that resulted from processing buffer A.
> >>> +
> >>> +Glossary
> >>> +========
> >>> +
> >>> +Refer to :ref:`decoder-glossary`.
> >>> +
> >>> +State machine
> >>> +=============
> >>> +
> >>> +.. kernel-render:: DOT
> >>> + :alt: DOT digraph of encoder state machine
> >>> + :caption: Encoder state machine
> >>> +
> >>> + digraph encoder_state_machine {
> >>> + node [shape = doublecircle, label="Encoding"] Encoding;
> >>> +
> >>> + node [shape = circle, label="Initialization"] Initialization;
> >>> + node [shape = circle, label="Stopped"] Stopped;
> >>> + node [shape = circle, label="Drain"] Drain;
> >>> + node [shape = circle, label="Reset"] Reset;
> >>> +
> >>> + node [shape = point]; qi
> >>> + qi -> Initialization [ label = "open()" ];
> >>> +
> >>> + Initialization -> Encoding [ label = "Both queues streaming" ];
> >>> +
> >>> + Encoding -> Drain [ label = "V4L2_DEC_CMD_STOP" ];
> >>> + Encoding -> Reset [ label = "VIDIOC_STREAMOFF(CAPTURE)" ];
> >>> + Encoding -> Stopped [ label = "VIDIOC_STREAMOFF(OUTPUT)" ];
> >>> + Encoding -> Encoding;
> >>> +
> >>> + Drain -> Stopped [ label = "All CAPTURE\nbuffers dequeued\nor\nVIDIOC_STREAMOFF(CAPTURE)" ];
> >>> + Drain -> Reset [ label = "VIDIOC_STREAMOFF(CAPTURE)" ];
> >>> +
> >>> + Reset -> Encoding [ label = "VIDIOC_STREAMON(CAPTURE)" ];
> >>> + Reset -> Initialization [ label = "VIDIOC_REQBUFS(OUTPUT, 0)" ];
> >>> +
> >>> + Stopped -> Encoding [ label = "V4L2_DEC_CMD_START\nor\nVIDIOC_STREAMON(OUTPUT)" ];
> >>> + Stopped -> Reset [ label = "VIDIOC_STREAMOFF(CAPTURE)" ];
> >>> + }
> >>> +
> >>> +Querying capabilities
> >>> +=====================
> >>> +
> >>> +1. To enumerate the set of coded formats supported by the encoder, the
> >>> + client may call :c:func:`VIDIOC_ENUM_FMT` on ``CAPTURE``.
> >>> +
> >>> + * The full set of supported formats will be returned, regardless of the
> >>> + format set on ``OUTPUT``.
> >>> +
> >>> +2. To enumerate the set of supported raw formats, the client may call
> >>> + :c:func:`VIDIOC_ENUM_FMT` on ``OUTPUT``.
> >>> +
> >>> + * Only the formats supported for the format currently active on ``CAPTURE``
> >>> + will be returned.
> >>> +
> >>> + * In order to enumerate raw formats supported by a given coded format,
> >>> + the client must first set that coded format on ``CAPTURE`` and then
> >>> + enumerate the formats on ``OUTPUT``.
> >>> +
> >>> +3. The client may use :c:func:`VIDIOC_ENUM_FRAMESIZES` to detect supported
> >>> + resolutions for a given format, passing desired pixel format in
> >>> + :c:type:`v4l2_frmsizeenum` ``pixel_format``.
> >>> +
> >>> + * Values returned by :c:func:`VIDIOC_ENUM_FRAMESIZES` for a coded pixel
> >>> + format will include all possible coded resolutions supported by the
> >>> + encoder for given coded pixel format.
> >>> +
> >>> + * Values returned by :c:func:`VIDIOC_ENUM_FRAMESIZES` for a raw pixel format
> >>> + will include all possible frame buffer resolutions supported by the
> >>> + encoder for given raw pixel format and coded format currently set on
> >>> + ``CAPTURE``.
> >>> +
> >>> +4. Supported profiles and levels for the coded format currently set on
> >>> + ``CAPTURE``, if applicable, may be queried using their respective controls
> >>> + via :c:func:`VIDIOC_QUERYCTRL`.
> >>> +
> >>> +5. Any additional encoder capabilities may be discovered by querying
> >>> + their respective controls.
> >>> +
> >>> +Initialization
> >>> +==============
> >>> +
> >>> +1. Set the coded format on the ``CAPTURE`` queue via :c:func:`VIDIOC_S_FMT`
> >>> +
> >>> + * **Required fields:**
> >>> +
> >>> + ``type``
> >>> + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``
> >>> +
> >>> + ``pixelformat``
> >>> + the coded format to be produced
> >>> +
> >>> + ``sizeimage``
> >>> + desired size of ``CAPTURE`` buffers; the encoder may adjust it to
> >>> + match hardware requirements
> >>> +
> >>> + ``width``, ``height``
> >>> + ignored (always zero)
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> + * **Return fields:**
> >>> +
> >>> + ``sizeimage``
> >>> + adjusted size of ``CAPTURE`` buffers
> >>> +
> >>> + .. important::
> >>> +
> >>> + Changing the ``CAPTURE`` format may change the currently set ``OUTPUT``
> >>> + format. The encoder will derive a new ``OUTPUT`` format from the
> >>> + ``CAPTURE`` format being set, including resolution, colorimetry
> >>> + parameters, etc. If the client needs a specific ``OUTPUT`` format, it
> >>> + must adjust it afterwards.
> >>> +
> >>> +2. **Optional.** Enumerate supported ``OUTPUT`` formats (raw formats for
> >>> + source) for the selected coded format via :c:func:`VIDIOC_ENUM_FMT`.
> >>> +
> >>> + * **Required fields:**
> >>> +
> >>> + ``type``
> >>> + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> + * **Return fields:**
> >>> +
> >>> + ``pixelformat``
> >>> + raw format supported for the coded format currently selected on
> >>> + the ``CAPTURE`` queue.
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> +3. Set the raw source format on the ``OUTPUT`` queue via
> >>> + :c:func:`VIDIOC_S_FMT`.
> >>> +
> >>> + * **Required fields:**
> >>> +
> >>> + ``type``
> >>> + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``
> >>> +
> >>> + ``pixelformat``
> >>> + raw format of the source
> >>> +
> >>> + ``width``, ``height``
> >>> + source resolution
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> + * **Return fields:**
> >>> +
> >>> + ``width``, ``height``
> >>> + may be adjusted by encoder to match alignment requirements, as
> >>> + required by the currently selected formats
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> + * Setting the source resolution will reset the selection rectangles to their
> >>> + default values, based on the new resolution, as described in the step 5
> >>> + below.
> >>> +
> >>> +4. **Optional.** Set the visible resolution for the stream metadata via
> >>> + :c:func:`VIDIOC_S_SELECTION` on the ``OUTPUT`` queue.
> >>> +
> >>> + * **Required fields:**
> >>> +
> >>> + ``type``
> >>> + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``
> >>> +
> >>> + ``target``
> >>> + set to ``V4L2_SEL_TGT_CROP``
> >>> +
> >>> + ``r.left``, ``r.top``, ``r.width``, ``r.height``
> >>> + visible rectangle; this must fit within the `V4L2_SEL_TGT_CROP_BOUNDS`
> >>> + rectangle and may be subject to adjustment to match codec and
> >>> + hardware constraints
> >>> +
> >>> + * **Return fields:**
> >>> +
> >>> + ``r.left``, ``r.top``, ``r.width``, ``r.height``
> >>> + visible rectangle adjusted by the encoder
> >>> +
> >>> + * The following selection targets are supported on ``OUTPUT``:
> >>> +
> >>> + ``V4L2_SEL_TGT_CROP_BOUNDS``
> >>> + equal to the full source frame, matching the active ``OUTPUT``
> >>> + format
> >>> +
> >>> + ``V4L2_SEL_TGT_CROP_DEFAULT``
> >>> + equal to ``V4L2_SEL_TGT_CROP_BOUNDS``
> >>> +
> >>> + ``V4L2_SEL_TGT_CROP``
> >>> + rectangle within the source buffer to be encoded into the
> >>> + ``CAPTURE`` stream; defaults to ``V4L2_SEL_TGT_CROP_DEFAULT``
> >>> +
> >>> + .. note::
> >>> +
> >>> + A common use case for this selection target is encoding a source
> >>> + video with a resolution that is not a multiple of a macroblock,
> >>> + e.g. the common 1920x1080 resolution may require the source
> >>> + buffers to be aligned to 1920x1088 for codecs with 16x16 macroblock
> >>> + size. To avoid encoding the padding, the client needs to explicitly
> >>> + configure this selection target to 1920x1080.
> >>> +
> >>> + ``V4L2_SEL_TGT_COMPOSE_BOUNDS``
> >>> + maximum rectangle within the coded resolution, which the cropped
> >>> + source frame can be composed into; if the hardware does not support
> >>> + composition or scaling, then this is always equal to the rectangle of
> >>> + width and height matching ``V4L2_SEL_TGT_CROP`` and located at (0, 0)
> >>> +
> >>> + ``V4L2_SEL_TGT_COMPOSE_DEFAULT``
> >>> + equal to a rectangle of width and height matching
> >>> + ``V4L2_SEL_TGT_CROP`` and located at (0, 0)
> >>> +
> >>> + ``V4L2_SEL_TGT_COMPOSE``
> >>> + rectangle within the coded frame, which the cropped source frame
> >>> + is to be composed into; defaults to
> >>> + ``V4L2_SEL_TGT_COMPOSE_DEFAULT``; read-only on hardware without
> >>> + additional compose/scaling capabilities; resulting stream will
> >>> + have this rectangle encoded as the visible rectangle in its
> >>> + metadata
> >>
> >> I would only support the COMPOSE targets if the hardware can actually do
> >> scaling and/or composing. That is conform standard V4L2 behavior where
> >> cropping/composing is only implemented if the hardware can actually do
> >> this.
> >>
> >
> > Please see my other reply to your earlier similar comment in this thread.
> >
> >>> +
> >>> + .. warning::
> >>> +
> >>> + The encoder may adjust the crop/compose rectangles to the nearest
> >>> + supported ones to meet codec and hardware requirements. The client needs
> >>> + to check the adjusted rectangle returned by :c:func:`VIDIOC_S_SELECTION`.
> >>> +
> >>> +5. Allocate buffers for both ``OUTPUT`` and ``CAPTURE`` via
> >>> + :c:func:`VIDIOC_REQBUFS`. This may be performed in any order.
> >>> +
> >>> + * **Required fields:**
> >>> +
> >>> + ``count``
> >>> + requested number of buffers to allocate; greater than zero
> >>> +
> >>> + ``type``
> >>> + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT`` or
> >>> + ``CAPTURE``
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> + * **Return fields:**
> >>> +
> >>> + ``count``
> >>> + actual number of buffers allocated
> >>> +
> >>> + .. warning::
> >>> +
> >>> + The actual number of allocated buffers may differ from the ``count``
> >>> + given. The client must check the updated value of ``count`` after the
> >>> + call returns.
> >>> +
> >>> + .. note::
> >>> +
> >>> + To allocate more than the minimum number of OUTPUT buffers (for pipeline
> >>> + depth), the client may query the ``V4L2_CID_MIN_BUFFERS_FOR_OUTPUT``
> >>> + control to get the minimum number of buffers required, and pass the
> >>> + obtained value plus the number of additional buffers needed in the
> >>> + ``count`` field to :c:func:`VIDIOC_REQBUFS`.
> >>> +
> >>> + Alternatively, :c:func:`VIDIOC_CREATE_BUFS` can be used to have more
> >>> + control over buffer allocation.
> >>> +
> >>> + * **Required fields:**
> >>> +
> >>> + ``count``
> >>> + requested number of buffers to allocate; greater than zero
> >>> +
> >>> + ``type``
> >>> + a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``
> >>> +
> >>> + other fields
> >>> + follow standard semantics
> >>> +
> >>> + * **Return fields:**
> >>> +
> >>> + ``count``
> >>> + adjusted to the number of allocated buffers
> >>> +
> >>> +6. Begin streaming on both ``OUTPUT`` and ``CAPTURE`` queues via
> >>> + :c:func:`VIDIOC_STREAMON`. This may be performed in any order. The actual
> >>> + encoding process starts when both queues start streaming.
> >>> +
> >>> +.. note::
> >>> +
> >>> + If the client stops the ``CAPTURE`` queue during the encode process and then
> >>> + restarts it again, the encoder will begin generating a stream independent
> >>> + from the stream generated before the stop. The exact constraints depend
> >>> + on the coded format, but may include the following implications:
> >>> +
> >>> + * encoded frames produced after the restart must not reference any
> >>> + frames produced before the stop, e.g. no long term references for
> >>> + H.264,
> >>> +
> >>> + * any headers that must be included in a standalone stream must be
> >>> + produced again, e.g. SPS and PPS for H.264.
> >>> +
> >>> +Encoding
> >>> +========
> >>> +
> >>> +This state is reached after the `Initialization` sequence finishes
> >>> +successfully. In this state, the client queues and dequeues buffers to both
> >>> +queues via :c:func:`VIDIOC_QBUF` and :c:func:`VIDIOC_DQBUF`, following the
> >>> +standard semantics.
> >>> +
> >>> +The contents of encoded ``CAPTURE`` buffers depend on the active coded pixel
> >>> +format and may be affected by codec-specific extended controls, as stated
> >>> +in the documentation of each format.
> >>> +
> >>> +Both queues operate independently, following standard behavior of V4L2 buffer
> >>> +queues and memory-to-memory devices. In addition, the order of encoded frames
> >>> +dequeued from the ``CAPTURE`` queue may differ from the order of queuing raw
> >>> +frames to the ``OUTPUT`` queue, due to properties of the selected coded format,
> >>> +e.g. frame reordering.
> >>> +
> >>> +The client must not assume any direct relationship between ``CAPTURE`` and
> >>> +``OUTPUT`` buffers and any specific timing of buffers becoming
> >>> +available to dequeue. Specifically:
> >>> +
> >>> +* a buffer queued to ``OUTPUT`` may result in more than 1 buffer produced on
> >>> + ``CAPTURE`` (if returning an encoded frame allowed the encoder to return a
> >>> + frame that preceded it in display, but succeeded it in the decode order),
> >>> +
> >>> +* a buffer queued to ``OUTPUT`` may result in a buffer being produced on
> >>> + ``CAPTURE`` later into encode process, and/or after processing further
> >>> + ``OUTPUT`` buffers, or be returned out of order, e.g. if display
> >>> + reordering is used,
> >>> +
> >>> +* buffers may become available on the ``CAPTURE`` queue without additional
> >>> + buffers queued to ``OUTPUT`` (e.g. during drain or ``EOS``), because of the
> >>> + ``OUTPUT`` buffers queued in the past whose decoding results are only
> >>> + available at later time, due to specifics of the decoding process,
> >>> +
> >>> +* buffers queued to ``OUTPUT`` may not become available to dequeue instantly
> >>> + after being encoded into a corresponding ``CATPURE`` buffer, e.g. if the
> >>> + encoder needs to use the frame as a reference for encoding further frames.
> >>> +
> >>> +.. note::
> >>> +
> >>> + To allow matching encoded ``CAPTURE`` buffers with ``OUTPUT`` buffers they
> >>> + originated from, the client can set the ``timestamp`` field of the
> >>> + :c:type:`v4l2_buffer` struct when queuing an ``OUTPUT`` buffer. The
> >>> + ``CAPTURE`` buffer(s), which resulted from encoding that ``OUTPUT`` buffer
> >>> + will have their ``timestamp`` field set to the same value when dequeued.
> >>> +
> >>> + In addition to the straightforward case of one ``OUTPUT`` buffer producing
> >>> + one ``CAPTURE`` buffer, the following cases are defined:
> >>> +
> >>> + * one ``OUTPUT`` buffer generates multiple ``CAPTURE`` buffers: the same
> >>> + ``OUTPUT`` timestamp will be copied to multiple ``CAPTURE`` buffers,
> >>> +
> >>> + * the encoding order differs from the presentation order (i.e. the
> >>> + ``CAPTURE`` buffers are out-of-order compared to the ``OUTPUT`` buffers):
> >>> + ``CAPTURE`` timestamps will not retain the order of ``OUTPUT`` timestamps
> >>> + and thus monotonicity of the timestamps cannot be guaranteed.
> >>> +
> >>> +.. note::
> >>> +
> >>> + To let the client distinguish between frame types (keyframes, intermediate
> >>> + frames; the exact list of types depends on the coded format), the
> >>> + ``CAPTURE`` buffers will have corresponding flag bits set in their
> >>> + :c:type:`v4l2_buffer` struct when dequeued. See the documentation of
> >>> + :c:type:`v4l2_buffer` and each coded pixel format for exact list of flags
> >>> + and their meanings.
> >>
> >> I don't think we can require this since a capture buffer may contain multiple
> >> encoded frames.
> >>
> >
> > I thought we required that only one encoded frame was in one CAPTURE
> > buffer. Real time use cases rely heavily on this frame type
> > information, so I can't imagine not requiring this.
>
> That the CAPTURE buffer contains only one encoded frame is never stated
> explicitly. I am not so sure I want that to be a hard requirement anyway
> since the old ivtv MPEG encoder just produces a bitstream.
>
> Perhaps this should be signaled with a flag in ENUM_FMT?
>
> >
> >> It would actually make more sense to return it in the output buffer, but I don't
> >> know if a hardware encoder can actually provide that information.
> >>
> >
> > I believe all the already existing drivers provide the information
> > about the encoded frame type, but I don't think they provide the
> > information about what source frame it came from.
> >
> >> Another use of these flags for an output buffer is to force a keyframe if for
> >> example a scene change was detected.
> >>
> >> My feeling is that we should drop this note. Forcing a keyframe by setting that
> >> flag for the output buffer might actually be a useful thing to do for a stateful
> >> encoder.
> >>
> >
> > However, to force keyframe, one sets it in the OUTPUT buffer. Then, to
> > actually get the right CAPTURE buffer, one has to look for one with
> > this flag set.
>
> So *if* the driver stores only one encoded frame in a CAPTURE buffer, then we
> can require that these flags have to be set for that CAPTURE buffer. Otherwise
> they should be cleared since they cannot be associated with a specific buffer.

But then we don't know to which source frame it applies, while it's
usually quite important to force the key frame at the right frame,
e.g. scene change.

>
> And I think it should be documented that you can set the KEYFRAME flag in the
> OUTPUT buffer to force a keyframe (the driver may ignore this if it can't do
> this for some reason).

Indeed. Let me make sure it's included in the document.

Best regards,
Tomasz