Re: [PATCH v4 2/2] media: cedrus: Add H264 decoding support

From: Jernej Åkrabec
Date: Tue Mar 05 2019 - 12:05:19 EST


Dne torek, 05. marec 2019 ob 11:17:32 CET je Maxime Ripard napisal(a):
> Hi Jernej,
>
> On Wed, Feb 20, 2019 at 06:50:54PM +0100, Jernej Åkrabec wrote:
> > I really wanted to do another review on previous series but got distracted
> > by analyzing one particulary troublesome H264 sample. It still doesn't
> > work correctly, so I would ask you if you can test it with your stack (it
> > might be userspace issue):
> >
> > http://jernej.libreelec.tv/videos/problematic/test.mkv
> >
> > Please take a look at my comments below.
>
> I'd really prefer to focus on getting this merged at this point, and
> then fixing odd videos and / or setups we can find later
> on. Especially when new stacks are going to be developped on top of
> this, I'm sure we're going to have plenty of bugs to address :)

I forgot to mention, you can add:
Reviewed-by: Jernej Skrabec <jernej.skrabec@xxxxxxxx>

once you fix issues. Please take a look below for comments.

>
> > Dne sreda, 20. februar 2019 ob 15:17:34 CET je Maxime Ripard napisal(a):
> > > Introduce some basic H264 decoding support in cedrus. So far, only the
> > > baseline profile videos have been tested, and some more advanced
> > > features
> > > used in higher profiles are not even implemented.
> >
> > What is not yet implemented? Multi slice frame decoding, interlaced frames
> > and decoding frames with width > 2048. Anything else?
>
> Off the top of my head, nope.
>
> > > +static void cedrus_h264_write_sram(struct cedrus_dev *dev,
> > > + enum cedrus_h264_sram_off off,
> > > + const void *data, size_t len)
> > > +{
> > > + const u32 *buffer = data;
> > > + size_t count = DIV_ROUND_UP(len, 4);
> > > +
> > > + cedrus_write(dev, VE_AVC_SRAM_PORT_OFFSET, off << 2);
> > > +
> > > + do {
> > > + cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
> > > + } while (--count);
> >
> > Above loop will still write one word for count = 0. I propose following:
> >
> > while (count--)
> >
> > cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
>
> Good catch, thanks!
>
> > > + position = find_next_zero_bit(&used_dpbs, CEDRUS_H264_FRAME_NUM,
> > > + output);
> > > + if (position >= CEDRUS_H264_FRAME_NUM)
> > > + position = find_first_zero_bit(&used_dpbs,
> >
> > CEDRUS_H264_FRAME_NUM);
> >
> > I guess you didn't try any interlaced videos? Sometimes it happens that
> > buffer is reference and output at the same time. In such cases, above
> > code would make two entries, which doesn't work based on Kwiboo's and my
> > experiments.
> >
> > I guess decoding interlaced videos is out of scope at this time?
>
> Yep, and that should be pretty easy to fix.
>
> > > +
> > > + output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
> > > + output_buf->codec.h264.position = position;
> > > +
> > > + if (slice->flags & V4L2_H264_SLICE_FLAG_FIELD_PIC)
> > > + output_buf->codec.h264.pic_type =
> >
> > CEDRUS_H264_PIC_TYPE_FIELD;
> >
> > > + else if (sps->flags & V4L2_H264_SPS_FLAG_MB_ADAPTIVE_FRAME_FIELD)
> > > + output_buf->codec.h264.pic_type =
> >
> > CEDRUS_H264_PIC_TYPE_MBAFF;
> >
> > > + else
> > > + output_buf->codec.h264.pic_type =
> >
> > CEDRUS_H264_PIC_TYPE_FRAME;
> >
> > > +
> > > + cedrus_fill_ref_pic(ctx, output_buf,
> > > + dec_param->top_field_order_cnt,
> > > + dec_param->bottom_field_order_cnt,
> > > + &pic_list[position]);
> > > +
> > > + cedrus_h264_write_sram(dev, CEDRUS_SRAM_H264_FRAMEBUFFER_LIST,
> > > + pic_list, sizeof(pic_list));
> > > +
> > > + cedrus_write(dev, VE_H264_OUTPUT_FRAME_IDX, position);
> > > +}
> > > +
> > > +#define CEDRUS_MAX_REF_IDX 32
> > > +
> > > +static void _cedrus_write_ref_list(struct cedrus_ctx *ctx,
> > > + struct cedrus_run *run,
> > > + const u8 *ref_list, u8
num_ref,
> > > + enum cedrus_h264_sram_off sram)
> > > +{
> > > + const struct v4l2_ctrl_h264_decode_param *decode = run-
> > >
> > >h264.decode_param;
> > >
> > > + struct vb2_queue *cap_q = &ctx->fh.m2m_ctx->cap_q_ctx.q;
> > > + const struct vb2_buffer *dst_buf = &run->dst->vb2_buf;
> > > + struct cedrus_dev *dev = ctx->dev;
> > > + u8 sram_array[CEDRUS_MAX_REF_IDX];
> > > + unsigned int i;
> > > + size_t size;
> > > +
> > > + memset(sram_array, 0, sizeof(sram_array));
> > > +
> > > + for (i = 0; i < num_ref; i++) {
> > > + const struct v4l2_h264_dpb_entry *dpb;
> > > + const struct cedrus_buffer *cedrus_buf;
> > > + const struct vb2_v4l2_buffer *ref_buf;
> > > + unsigned int position;
> > > + int buf_idx;
> > > + u8 dpb_idx;
> > > +
> > > + dpb_idx = ref_list[i];
> > > + dpb = &decode->dpb[dpb_idx];
> > > +
> > > + if (!(dpb->flags & V4L2_H264_DPB_ENTRY_FLAG_ACTIVE))
> > > + continue;
> > > +
> > > + buf_idx = vb2_find_timestamp(cap_q, dpb->timestamp, 0);
> > > + if (buf_idx < 0)
> > > + continue;
> > > +
> > > + ref_buf = to_vb2_v4l2_buffer(ctx->dst_bufs[buf_idx]);
> > > + cedrus_buf = vb2_v4l2_to_cedrus_buffer(ref_buf);
> > > + position = cedrus_buf->codec.h264.position;
> > > +
> > > + sram_array[i] |= position << 1;
> > > + if (ref_buf->field == V4L2_FIELD_BOTTOM)
> >
> > I'm still not convinced that checking buffer field is appropriate solution
> > here. IMO this bit defines top or bottom reference and same buffer could
> > be used for both.
> >
> > But I guess this belongs for follow up patch which will fix decoding
> > interlaced videos.
>
> And we can always change the API later on if we find that not adequate
>
> > > +static void cedrus_write_scaling_lists(struct cedrus_ctx *ctx,
> > > + struct cedrus_run *run)
> > > +{
> > > + const struct v4l2_ctrl_h264_scaling_matrix *scaling =
> > > + run->h264.scaling_matrix;
> > > + struct cedrus_dev *dev = ctx->dev;
> > > +
> > > + if (!scaling)
> > > + return;
> > > +
> > > + cedrus_h264_write_sram(dev, CEDRUS_SRAM_H264_SCALING_LIST_8x8_0,
> > > + scaling->scaling_list_8x8[0],
> > > + sizeof(scaling-
>scaling_list_8x8[0]));
> > > +
> > > + cedrus_h264_write_sram(dev, CEDRUS_SRAM_H264_SCALING_LIST_8x8_1,
> > > + scaling->scaling_list_8x8[1],
> > > + sizeof(scaling-
>scaling_list_8x8[1]));
> >
> > Index above should be 3. IIRC 1 and 3 are used by 4:2:0 chroma
> > subsampling,
> > but currently I'm unable to find reference to that in standard.
>
> Yep, indeed, I'll fix that, thanks!

As I said in previous e-mail, I made a mistake, it should be 0 and 3.

>
> > > +
> > > + cedrus_h264_write_sram(dev, CEDRUS_SRAM_H264_SCALING_LIST_4x4,
> > > + scaling->scaling_list_4x4,
> > > + sizeof(scaling->scaling_list_4x4));
> > > +}
> > > +
> > > +static void cedrus_write_pred_weight_table(struct cedrus_ctx *ctx,
> > > + struct cedrus_run
> >
> > *run)
> >
> > > +{
> > > + const struct v4l2_ctrl_h264_slice_param *slice =
> > > + run->h264.slice_param;
> > > + const struct v4l2_h264_pred_weight_table *pred_weight =
> > > + &slice->pred_weight_table;
> > > + struct cedrus_dev *dev = ctx->dev;
> > > + int i, j, k;
> > > +
> > > + cedrus_write(dev, VE_H264_SHS_WP,
> > > + ((pred_weight->chroma_log2_weight_denom & 0xf) <<
> >
> > 4) |
> >
> > > + ((pred_weight->luma_log2_weight_denom & 0xf) <<
> >
> > 0));
> >
> > Denominators are only in range of 0-7, so mask should be 0x7. CedarX code
> > also specify those two fields 3 bits wide.
>
> Indeed, I'll fix it.
>
> > > +
> > > + cedrus_write(dev, VE_AVC_SRAM_PORT_OFFSET,
> > > + CEDRUS_SRAM_H264_PRED_WEIGHT_TABLE << 2);
> > > +
> > > + for (i = 0; i < ARRAY_SIZE(pred_weight->weight_factors); i++) {
> > > + const struct v4l2_h264_weight_factors *factors =
> > > + &pred_weight->weight_factors[i];
> > > +
> > > + for (j = 0; j < ARRAY_SIZE(factors->luma_weight); j++)
{
> > > + u32 val;
> > > +
> > > + val = ((factors->luma_offset[j] & 0x1ff) <<
16)
> > >
> > > + (factors->luma_weight[j] & 0x1ff);
> > > + cedrus_write(dev, VE_AVC_SRAM_PORT_DATA,
> >
> > val);
> >
> > You should cast offset varible to wider type. Currently some videos which
> > use prediction weight table don't work for me, unless offset is casted to
> > u32 first. Shifting 8 bit variable for 16 places gives you 0 every time.
>
> I'll do it.
>
> > Luma offset and weight are defined as s8, so having wider mask doesn't
> > really make sense. However, I think weight should be s16 anyway, because
> > standard says that it's value could be 2^denominator for default value or
> > in range -128..127. Worst case would be 2^7 = 128 and -128. To cover both
> > values you need at least 9 bits.
>
> But if I understood the spec right, in that case you would just have
> the denominator set, and not the offset, while the offset is used if
> you don't use the default formula (and therefore remains in the -128
> 127 range which is covered by the s8), right?

Yeah, default offset is 0 and s8 is sufficient for that. I'm talking about
weight. Default weight is "1 << denominator", which might be 1 << 7 or 128.

We could also add a flag, which would signal default table. In that case we
could just set a bit to tell VPU to use default values. Even if some VPUs need
default table to be set explicitly, it's very easy to calculate values as
mentioned in previous paragraph.

Best regards,
Jernej

>
> > > + reg = 0;
> > > + if (!(scaling && (pps->flags &
> > > V4L2_H264_PPS_FLAG_PIC_SCALING_MATRIX_PRESENT))) + reg |=
> > > VE_H264_SHS_QP_SCALING_MATRIX_DEFAULT;
> > > + reg |= (pps->second_chroma_qp_index_offset & 0x3f) << 16;
> > > + reg |= (pps->chroma_qp_index_offset & 0x3f) << 8;
> > > + reg |= (pps->pic_init_qp_minus26 + 26 + slice->slice_qp_delta) &
> >
> > 0x3f;
> >
> > > + cedrus_write(dev, VE_H264_SHS_QP, reg);
> > > +
> > > + // clear status flags
> > > + cedrus_write(dev, VE_H264_STATUS, cedrus_read(dev,
> >
> > VE_H264_STATUS));
> >
> > I'm not sure clearing status here is needed. Do you have any case where it
> > is need? Maybe if some error happened before and cedrus_h264_irq_clear()
> > wasn't cleared. I'm fine either way.
>
> Yeah, it's just some extra precaution.
>
> > > +
> > > + // enable int
> > > + reg = cedrus_read(dev, VE_H264_CTRL);
> > > + cedrus_write(dev, VE_H264_CTRL, reg |
> > > + VE_H264_CTRL_SLICE_DECODE_INT |
> > > + VE_H264_CTRL_DECODE_ERR_INT |
> > > + VE_H264_CTRL_VLD_DATA_REQ_INT);
> >
> > Since this is the only place where you set VE_H264_CTRL, I wouldn't
> > preserve previous content. This mode is also capable of decoding VP8 and
> > AVS. So in theory, if user would want to decode H264 and VP8 videos at
> > the same time, preserving content will probably corrupt your output. I
> > would just set all other bits to 0. What do you think? I tested this
> > without preservation and it works fine.
>
> I'll change it.
>
> > > + /*
> > > + * FIXME: This is actually conditional to
> > > + * V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY not being set, we might
> > > + * have to rework this if memory efficiency ever is something
> > > + * we need to work on.
> > > + */
> > > + field_size = field_size * 2;
> > > + ctx->codec.h264.mv_col_buf_field_size = field_size;
> >
> > CedarX code aligns this buffer to 1024. Should we do it too just to be on
> > the safe side? I don't think it cost us anything due to
> > dma_alloc_coherent() alignments.
>
> dma_alloc_coherent will operate on pages, so it doesn't make any
> difference there.
>
> > Sorry again for a bit late in-depth review.
>
> Thanks a lot!
> Maxime