Re: x264 benchmarks BFS vs CFS

From: Jason Garrett-Glaser
Date: Fri Dec 18 2009 - 05:12:43 EST


On Thu, Dec 17, 2009 at 11:30 PM, Mike Galbraith <efault@xxxxxx> wrote:
> On Fri, 2009-12-18 at 06:23 +0100, Ingo Molnar wrote:
>
>> Having said that, we generally try to make things perform well without apps
>> having to switch themselves to SCHED_BATCH. Mike, do you think we can make
>> x264 perform as well (or nearly as well) under SCHED_OTHER as under
>> SCHED_BATCH?
>
> It's not bad as is, except for ultrafast mode.  START_DEBIT is the
> biggest problem there.  I don't think SCHED_OTHER will ever match
> SCHED_BATCH for this load, though I must say I haven't full-spectrum
> tested.  This load really wants RR scheduling, and wakeup preemption
> necessarily perturbs run order.
>
> I'll probably piddle with it some more, it's an interesting load.
>
>        -Mike
>
>

Two more thoughts here:

1) We're considering moving to a thread pool soon; we already have a
working patch for it and if anything it'll save a few clocks spent on
nice()ing threads and other such things. Will this improve
START_DEBIT at all? I've attached the beta patch if you want to try
it. Note this also works with 2) as well, so it adds yet another
dimension to what's mentioned below.

2) We recently implemented a new threading model which may be
interesting to test as well. This threading model gives worse
compression *and* performance, but has one benefit: it adds zero
latency, whereas normal threading adds a full frame of latency per
thread. This was paid for by a company interested in
ultra-low-latency streaming applications, where 1 millisecond is a
huge deal. I've been thinking this might be interesting to bench from
a kernel perspective as well, as when you're spawning a half-dozen
threads and need them all done within 6 milliseconds, you start
getting down to serious scheduler issues.

The new threading model is much less complex than the regular one and
works as follows. The frame is split into X slices, and each slice
encoded with one thread. Specifically, it works via the following
process:

1. Preprocess input frame, perform lookahead analysis on input frame
(all singlethreaded)
2. Split up a ton of threads to do the main encode, one per slice.
3. Join all the threads.
4. Do post-filtering on the output frame, return.

Clearly this is an utter disaster, since it spawns N times as many
threads as the old threading model *and* they last far shorter, *and*
only part of the application is multithreaded. But there's not really
a better way to do low-latency threading, and it's an interesting
challenge to boot. IIRC, it's also the way ffmpeg's encoder threading
works. It's widely considered an inferior model, but as mentioned
before, in this particular use-case there's no choice.

To enable this, use --sliced-threads. I'd recommend using a
higher-resolution clip for this, as it performs atrociously bad on
very low resolution videos for reasons you might be able to guess. If
you need a higher-res clip, check the SD or HD ones here:
http://media.xiph.org/video/derf/ .

I'm personally curious as to what kind of scheduler issues this
results in--I haven't done any BFS vs CFS tests with this option
enabled yet.

Jason

Attachment: thread_pool_slices.diff
Description: Binary data