Re: [PATCH 2/2] drm/i915: Limit the busy wait on requests to 2us not 10ms!

From: Tvrtko Ursulin
Date: Mon Nov 16 2015 - 07:08:17 EST

On 16/11/15 11:12, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote:


On 15/11/15 13:32, Chris Wilson wrote:
When waiting for high frequency requests, the finite amount of time
required to set up the irq and wait upon it limits the response rate. By
busywaiting on the request completion for a short while we can service
the high frequency waits as quick as possible. However, if it is a slow
request, we want to sleep as quickly as possible. The tradeoff between
waiting and sleeping is roughly the time it takes to sleep on a request,
on the order of a microsecond. Based on measurements from big core, I
have set the limit for busywaiting as 2 microseconds.

Sounds like solid reasoning. Would it also be worth finding the
trade off limit for small core?

Takes a bit longer, but 2us seems "ok" on PineView (as in it doesn't
lose the boost from spinning rather than sleeping). Have some more
testing to do on vlv/byt.


The code currently uses the jiffie clock, but that is far too coarse (on
the order of 10 milliseconds) and results in poor interactivity as the
CPU ends up being hogged by slow requests. To get microsecond resolution
we need to use a high resolution timer. The cheapest of which is polling
local_clock(), but that is only valid on the same CPU. If we switch CPUs
because the task was preempted, we can also use that as an indicator that
the system is too busy to waste cycles on spinning and we should sleep

Hm, need_resched would not cover the CPU switch anyway? Or maybe
need_resched means something other than I thought which is "there
are other runnable tasks"?

As I understand it, it means that the scheduler tick fired (or something
else yielded). I haven't spotted if it gets set as the runqueue changes.
As it pertains to us, it means that we need to get to schedule() as
quick as possible which along this path means going to sleep.

I'm not sure if need_resched() would catch the cpu switch, if we were
preempted as the flag would be set on the idle process not us.

Could be, I wasn't sure at all, just curious and trying to understand it fully. Cpu check is so cheap as implemented that it is not under any criticism.

This would also have impact on the patch subject line.I thought we
would burn a jiffie of CPU cycles only if there are no other
runnable tasks - so how come an impact on interactivity?

I have a couple of ideas for the effect on interactivty:

1. Burning through the time slice is acting as a penalty against running
that process (typically the compositor) in the near future, perhaps
enough to miss some deadlines.

2. Processor power balancing.

Also again I think the commit message needs some data on how this
was found and what is the impact.

The system felt unresponsive. It would be interesting for me to know a
few more details about the tick on that system (HZ, tickless?,
preemption?) to see if changing the config on my xps13 also produces the
lag/jitter/poor interactivty.

Yes interesting but not critical I think. Since the new scheme looks as efficient as the old one so there should be no downside anyway.

Btw as it happens, just last week as I was playing with perf, I did
notice busy spinning is the top cycle waster in some benchmarks. I
was in the process of trying to quantize the difference with it on
or off but did not complete it.

I found that spin-request appearing in profiles makes tracking down the
culprit higer in the stack much easier. Otherwise you have to remember to
enable a pass with the tracepoint to find the stalls (or use
intel-gpu-overlay which does it for you).

I'll put it on my TODO list of things to play with.

+static u64 local_clock_us(unsigned *cpu)
+ u64 t;
+ *cpu = get_cpu();
+ t = local_clock() >> 10;

Needs comment I think to explicitly mention the approximation, or
maybe drop the _us suffix?

I did consider _approx_us but thought that was overkill. A comment along
the lines of
/* Approximately convert ns to us - the error is less than the
* truncation!

And the result is not used in subsequent calculations apart from comparing against an approximate timeout?

@@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
if (signal_pending_state(state, current))

- if (time_after_eq(jiffies, timeout))
+ if (busywait_stop(timeout, cpu))


Otherwise looks good. Not sure what would you convert to 32-bit from
your follow up reply since you need us resolution?

s/u64/unsigned long/ s/time_after64/time_after/

32bits of us resolution gives us 1000s before wraparound between the two
samples. And I hope that a 1000s doesn't pass between loops. Or if it does,
the GPU managed to complete its task.

Now I see that you did say low bits.. yes that sounds fine.

Btw while you are optimizing things maybe pick up this micro optimization:

Not in scope of this thread but under the normal development patch flow.

Btw2, any benchmark result changes with this?


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at