Re: multi-second application stall in open()

From: Josh Hunt
Date: Fri Jun 22 2012 - 16:05:46 EST


On Fri, Jun 22, 2012 at 9:12 AM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> On Thu, Jun 21, 2012 at 04:11:18PM -0500, Josh Hunt wrote:
>
> [..]
>> > say put some logs in select_queue() and see where did it bail out. That
>>
>> Well I did add some instrumentation in select_queue, the "keep_queue
>> st->count:%d, dispatch:%u" line I mentioned above, but I will add more
>> and retest.
>
> Actually before the stall we expired the current queue. That means there
> is no active queue in cfq now. So keep_queue trace will help only if there
> is an active queue and we decide to keep that queue.
>
>  8,0    0        0  4466.139959742     0  m   N cfq20720 del_from_rr
>  8,0    0        0  4466.139963653     0  m   N cfq schedule dispatch
>  8,0    1  1499521  4466.791207736  7570  A   R 7258191 + 8 <- (8,1)
> 7258128
>
> Thing to figure out here is that why cfq is not picking a new queue
> despite the fact there are pending requests and there is no active
> queue (hence cfq is not obiviously not seeing that queue).
>
>>
>> I'm attaching a similar run with no stalls when I set slice_idle to 0.
>
> At this point I am not sure why slice_idle=0 is not seeing this issue. It
> does not see to be directly related to idling. Because when stall happens
> CFQ is not idling on anything. It has expired active queue and for some
> reason it thinks that I have no more requests to dispatch, so I have
> nothing to do until a new request comes in.
>
> Thanks
> Vivek

Got some more info. It actually looks like every idle timer fire there
is some sort of stall:

8,0 0 0 4807.015613266 0 m N cfq idle timer fired
8,0 0 0 4807.015616897 0 m N cfq7646
idle_slice_timer: active queue
8,0 0 0 4807.015618853 0 m N cfq7646
idle_slice_timer: clear_cfqq_deep
8,0 0 0 4807.015620529 0 m N cfq7646 slice expired t=0
8,0 0 0 4807.015622205 0 m N cfq7646 del timer
8,0 0 0 4807.015623323 0 m N cfq7646 update idle time stats
8,0 0 0 4807.015626396 0 m N cfq7646 sl_used=8
disp=1 charge=8 iops=0 sect=8
8,0 0 0 4807.015627793 0 m N cfq7646 slice_expired
del cfqq from rr
8,0 0 0 4807.015628910 0 m N cfq7646 del_from_rr
8,0 0 0 4807.015630307 0 m N cfq7646 slice_expired
resort_rr_list
8,0 0 0 4807.015631983 0 m N cfq7646 slice_expired
setting active_queue to NULL
8,0 0 0 4807.015633380 0 m N cfq7646 slice_expired
setting active_cic to NULL
8,0 0 438262 4807.717719091 6076 A W 1639679 + 8 <- (8,1) 1639616
8,0 0 438263 4807.717720767 6076 Q W 1639679 + 8 [flush-8:0]
8,0 0 438264 4807.717727192 6076 G W 1639679 + 8 [flush-8:0]
8,0 0 438265 4807.717730265 6076 P N [flush-8:0]

You can see in the attached log that it happens every timer fire to some extent.

We get to here in the code:
static void cfq_idle_slice_timer(unsigned long data)
{
...
/*
* Queue depth flag is reset only when the idle didn't succeed
*/
cfq_clear_cfqq_deep(cfqq);
cfq_log_cfqq(cfqd, cfqq, "idle_slice_timer: clear_cfqq_deep");
}
expire:
cfq_slice_expired(cfqd, timed_out);
out_kick:
cfq_schedule_dispatch(cfqd);
out_cont:
spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
}

and from the traces we are going through cfq_slice_expired() and then
cfq_schedule_dispatch(), but the schedule_dispatch print is not there,
which means cfqd->busy_queues is false.

static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
{
if (cfqd->busy_queues) {
cfq_log(cfqd, "schedule dispatch");
kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
}
}

I'm still going through the traces, but figured I'd pass along this
info for now. If we're stalling after every timer fire here, then it's
understandable why I'm not seeing the stalls when I set slice_idle to
0.

Will update if I have more. Let me know if there's something else
you'd like me to instrument.

--
Josh

Attachment: sda.parsed-moreverbose2.bz2
Description: BZip2 compressed data