Re: multi-second application stall in open()

From: Josh Hunt
Date: Thu Jun 21 2012 - 17:28:19 EST


On Thu, Jun 21, 2012 at 3:36 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hey, Vivek.
>
> On Thu, Jun 21, 2012 at 04:32:17PM -0400, Vivek Goyal wrote:
>> Here we deleted queue 20720 and did nothing for .6 seconds and from
>> previous logs it is visible that writes are pending and queued.
>>
>> For some reason cfq_schedule_dispatch() did not lead to kicking queue
>> or queue was kicked but somehow write queue was not selected for
>> dispatch (A case of corrupt data structures?).
>>
>> Are you able to reproduce this issue on latest kernels (3.5-rc2?). I would
>> say put some logs in select_queue() and see where did it bail out. That
>> will confirm that select queue was called and can also give some details
>> why we did not select async queue for dispatch. (Note: select_queue is called
>> multiple times so putting trace point there makes logs very verbose).
>
> Some people are putting in watchdog timers in block layer to kick cfq
> when it stalls with pending requests.  The cfq code there has diverged
> quite a bit from upstream so I have no idea whether it's caused by the
> same issue.  The symptom sounds exactly the same tho.  So, yeah, I
> think it isn't too unlikely that we have a cfq logic bug leading to
> stalls.  :(
>
> --
> tejun
Tejun

When you say the code has diverged from upstream, do you mean from 3.0
to 3.5? Or maybe I'm misunderstanding what you're getting at. Also, if
you have any links to the watchdog timer code you're referring to I
would appreciate it.

Thanks
--
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/