Re: __i915_spin_request() sucks

From: Jens Axboe
Date: Fri Nov 13 2015 - 10:12:14 EST


On 11/13/2015 02:15 AM, Chris Wilson wrote:
On Thu, Nov 12, 2015 at 03:52:02PM -0700, Jens Axboe wrote:
On 11/12/2015 03:19 PM, Chris Wilson wrote:
So today, I figured I'd try just killing that spin. If it fails, we'll
punt to normal completions, so easy change. And wow, MASSIVE difference.
I can now scroll in chrome and not rage! It's like the laptop is 10x
faster now.

Ran git blame, and found:

commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Date: Tue Apr 7 16:20:41 2015 +0100

drm/i915: Optimistically spin for the request completion

and read the commit message. Doesn't sound that impressive. Especially
not for something that screws up interactive performance by a LOT.

What's the deal? Revert?

The tests that it improved the most were the latency sensitive tests and
since my Broadwell xps13 behaves itself, I'd like to understand how it
culminates in an interactivity loss.

1. Maybe it is the uninterruptible nature of the polling, making X's
SIGIO jerky:

This one still feels bad.

2. Or maybe it is increased mutex contention:

And so does this one... I had to manually apply hunks 2-3, and after
doing seat-of-the-pants testing for both variants, I confirmed with
perf that we're still seeing a ton of time in __i915_wait_request()
for both of them.

Or maybe it is an indirect effect, such as power balancing between the
CPU and GPU, or just thermal throttling, or it may be the task being
penalised for consuming its timeslice (for which any completion polling
seems susceptible).

Look, polls in the 1-10ms range are just insane. Either you botched
the commit message and really meant "~1ms at most" and in which case
I'd suspect you of smoking something good, or you hacked it up wrong
and used jiffies when you really wanted to be using some other time
check that really did give you 1us.

What other time check? I was under the impression of setting up a
hrtimer was expensive and jiffies was available.

Looping for 10ms is a lot more expensive :-). jiffies is always there, but it's WAY too coarse to be used for something like this.

You could use ktime_get(), there's a lot of helpers for checking time_after, adding msecs, etc. Something like the below, not tested here
yet.

I'll take an IRQ over 10 msecs of busy looping on my laptop, thanks.

"Limit the spinning to a single jiffie (~1us) at most"

is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
Even if I had HZ=1000, that'd still be 1ms of spinning. That's
seriously screwed up, guys.

That's over and above the termination condition for blk_poll().

?! And this is related, how? Comparing apples and oranges. One is a
test opt-in feature for experimentation, the other is
unconditionally enabled for everyone. I believe the commit even says
so. See the difference? Would I use busy loop spinning waiting for
rotating storage completions, which are in the 1-10ms range? No,
with the reason being that the potential wins for spins are in the
usec range.

Equally I expect the service interval for a batch to be around 2-20us.
There are many workloads that execute 30-50k requests/s, and you can
appreciate that they are sensitive to the latency in setting up an irq
and receiving it - just as equally leaving that irq enabled saturates a
CPU with simply executing the irq handler. So what mechanism do you use
to guard against either a very long queue depth or an abnormal request
causing msec+ spins?

Not disputing that polling can work, but it needs to be a bit more clever. Do you know which requests are fast and which ones are not? Could you track it? Should we make this a module option?

20usec is too long to poll. If we look at the wins of polling, we're
talking anywhere from 1-2 usec to maybe 5 usec, depending on different
factors. So spinning between 1-3 usec should be a hard limit on most
platforms. And it's somewhat of a policy decision, since it does involve
throwing CPU at the problem. There's a crossover point where below it's
always a win, but that needs a lot more work than just optimistic
spinning for everything. You also do need a check for whether the task
has been woken up, that's also missing.

As for interrupt mitigation, I'd consider that a separate problem. It's
a lot simpler than the app induced polling that i915 is doing here. So
if overloading a core with IRQs is an issue, I'd solve that differently
similarly to NAPI or blk-iopoll (not to be confused with blk_poll() that
you referenced and is app induced polling).

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5cf4a1998273..658514e899b1 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1148,17 +1148,19 @@ static bool missed_irq(struct drm_i915_private *dev_priv,

static int __i915_spin_request(struct drm_i915_gem_request *req)
{
- unsigned long timeout;
+ ktime_t start, end;

if (i915_gem_request_get_ring(req)->irq_refcount)
return -EBUSY;

- timeout = jiffies + 1;
+ start = ktime_get();
+ end.tv64 = start.tv64;
+ ktime_add_us(end, 1);
while (!need_resched()) {
if (i915_gem_request_completed(req, true))
return 0;

- if (time_after_eq(jiffies, timeout))
+ if (ktime_after(start, end))
break;

cpu_relax_lowlatency();

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/