Re: [RFC PATCH v1] media: uvcvideo: Cache URB header data before processing

From: Keiichi Watanabe
Date: Wed Aug 08 2018 - 10:02:39 EST


Hi Kieran,
On Wed, Aug 8, 2018 at 10:07 PM Kieran Bingham
<kieran.bingham@xxxxxxxxxxxxxxxx> wrote:
>
> Hi All,
>
> On 08/08/18 13:45, Keiichi Watanabe wrote:
> > Hi Laurent, Kieran, Tomasz,
> >
> > Thank you for reviews and suggestions.
> > I want to do additional measurements for improving the performance.
> >
> > Let me clarify my understanding:
> > Currently, if the platform doesn't support coherent-DMA (e.g. ARM),
> > urb_buffer is allocated by usb_alloc_coherent with
> > URB_NO_TRANSFER_DMA_MAP flag instead of using kmalloc.
> > This is because we want to avoid frequent DMA mappings, which are
> > generally expensive.
> > However, memories allocated in this way are not cached.
> >
> > So, we wonder if using usb_alloc_coherent is really fast.
> > In other words, we want to know which is better:
> > "No DMA mapping/Uncached memory" v.s. "Frequent DMA mapping/Cached memory".
> >
> > For this purpose, I'm planning to measure performance on ARM
> > Chromebooks in the following conditions:
> > 1. Current implementation with Kieran's patches
> > 2. 1. + my patch
> > 3. Use kmalloc instead
> >
> > 1 and 2 are the same conditions I reported in the first mail on this thread.
> > For condition 3, I only have to add "#define CONFIG_DMA_NONCOHERENT"
> > at the beginning of uvc_video.c.
>
> I'd be interested in numbers/performances both with and without my async
> if possible too.
>
> The async path can be easily disabled temporarily with the following
> change: (perhaps this should be a module option?)
>
> diff --git a/drivers/media/usb/uvc/uvc_video.c
> b/drivers/media/usb/uvc/uvc_video.c
> index 8bb6e90f3483..f9fbdc9bfa4b 100644
> --- a/drivers/media/usb/uvc/uvc_video.c
> +++ b/drivers/media/usb/uvc/uvc_video.c
> @@ -1505,7 +1505,8 @@ static void uvc_video_complete(struct urb *urb)
> }
>
> INIT_WORK(&uvc_urb->work, uvc_video_copy_data_work);
> - queue_work(stream->async_wq, &uvc_urb->work);
> +// queue_work(stream->async_wq, &uvc_urb->work);
> + uvc_video_copy_data_work(&uvc_urb->work);
> }
>
> /*
>
>
> I do suspect that even with cached buffers, it's probably likely we
> should still consider the async patches to move the memcopy out of
> interrupt context.
>
It sounds better to check.
I'll do tests with/without your async patches in both cached/uncached
memory allocation conditions.

Best regards,
Keiichi

> --
> Regards
>
> Kieran
>
>
>
>
> >
> > Does this plan sound reasonable?
> >
> > Best regards,
> > Keiichi
> > On Wed, Aug 8, 2018 at 5:42 PM Laurent Pinchart
> > <laurent.pinchart@xxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi Tomasz,
> >>
> >> On Wednesday, 8 August 2018 07:08:59 EEST Tomasz Figa wrote:
> >>> On Tue, Jul 31, 2018 at 1:00 AM Laurent Pinchart wrote:
> >>>> On Wednesday, 27 June 2018 13:34:08 EEST Keiichi Watanabe wrote:
> >>>>> On some platforms with non-coherent DMA (e.g. ARM), USB drivers use
> >>>>> uncached memory allocation methods. In such situations, it sometimes
> >>>>> takes a long time to access URB buffers. This can be a cause of video
> >>>>> flickering problems if a resolution is high and a USB controller has
> >>>>> a very tight time limit. (e.g. dwc2) To avoid this problem, we copy
> >>>>> header data from (uncached) URB buffer into (cached) local buffer.
> >>>>>
> >>>>> This change should make the elapsed time of the interrupt handler
> >>>>> shorter on platforms with non-coherent DMA. We measured the elapsed
> >>>>> time of each callback of uvc_video_complete without/with this patch
> >>>>> while capturing Full HD video in
> >>>>> https://webrtc.github.io/samples/src/content/getusermedia/resolution/.
> >>>>> I tested it on the top of Kieran Bingham's Asynchronous UVC series
> >>>>> https://www.mail-archive.com/linux-media@xxxxxxxxxxxxxxx/msg128359.html.
> >>>>> The test device was Jerry Chromebook (RK3288) with Logitech Brio 4K.
> >>>>> I collected data for 5 seconds. (There were around 480 callbacks in
> >>>>> this case.) The following result shows that this patch makes
> >>>>> uvc_video_complete about 2x faster.
> >>>>>
> >>>>> | average | median | min | max | standard deviation
> >>>>> w/o caching| 45319ns | 40250ns | 33834ns | 142625ns| 16611ns
> >>>>> w/ caching| 20620ns | 19250ns | 12250ns | 56583ns | 6285ns
> >>>>>
> >>>>> In addition, we confirmed that this patch doesn't make it worse on
> >>>>> coherent DMA architecture by performing the same measurements on a
> >>>>> Broadwell Chromebox with the same camera.
> >>>>>
> >>>>> | average | median | min | max | standard deviation
> >>>>> w/o caching| 21026ns | 21424ns | 12263ns | 23956ns | 1932ns
> >>>>> w/ caching| 20728ns | 20398ns | 8922ns | 45120ns | 3368ns
> >>>>
> >>>> This is very interesting, and it seems related to https://
> >>>> patchwork.kernel.org/patch/10468937/. You might have seen that discussion
> >>>> as you got CC'ed at some point.
> >>>>
> >>>> I wonder whether performances couldn't be further improved by allocating
> >>>> the URB buffers cached, as that would speed up the memcpy() as well. Have
> >>>> you tested that by any chance ?
> >>>
> >>> We haven't measure it, but the issue being solved here was indeed
> >>> significantly reduced by using cached URB buffers, even without
> >>> Kieran's async series. After we discovered the latter, we just
> >>> backported it and decided to further tweak the last remaining bit, to
> >>> avoid playing too much with the DMA API in code used in production on
> >>> several different platforms (including both ARM and x86).
> >>>
> >>> If you think we could change the driver to use cached buffers instead
> >>> (as the pwc driver mentioned in another thread), I wouldn't have
> >>> anything against it obviously.
> >>
> >> I think there's a chance that performances could be further improved.
> >> Furthermore, it would lean to simpler code as we wouldn't need to deal with
> >> caching headers manually. I would however like to see numbers before making a
> >> decision.
> >>
> >> --
> >> Regards,
> >>
> >> Laurent Pinchart
> >>
> >>
> >>
>
> --
> Regards
> --
> Kieran