Re: [Query] Preemption (hogging) of the work handler
From: Viresh Kumar
Date: Wed Jul 06 2016 - 14:28:49 EST
On 01-07-16, 12:22, Tejun Heo wrote:
> Hello, Viresh.
>
> On Fri, Jul 01, 2016 at 09:59:59AM -0700, Viresh Kumar wrote:
> > The system watchdog uses a delayed-work (1 second) for petting the
> > watchdog (resetting its counter) and if the work doesn't reset the
> > counters in time (another 1 second), the watchdog resets the system.
> >
> > Petting-time: 1 second
> > Watchdog Reset-time: 2 seconds
> >
> > The wq is allocated with:
> > wdog_wq = alloc_workqueue("wdog", WQ_HIGHPRI, 0);
>
> You probably want WQ_MEM_RECLAIM there to guarantee that it can run
> quickly under memory pressure. In reality, this shouldn't matter too
> much as there aren't many competing highpri work items and thus it's
> likely that there are ready highpri workers waiting for work items.
>
> > The watchdog's work-handler looks like this:
> >
> > static void pet_watchdog_work(struct work_struct *work)
> > {
> >
> > ...
> >
> > pet_watchdog(); //Reset its counters
> >
> > /* CONFIG_HZ=300, queuing for 1 second */
> > queue_delayed_work(wdog_wq, &wdog_dwork, 300);
> > }
> >
> > kernel: 3.10 (Yeah, you can rant me for that, but its not something I
> > can decide on :)
>
> Android?
>
> > Symptoms:
> >
> > - The watchdog reboots the system sometimes. It is more reproducible
> > in cases where an (out-of-tree) bus enumerated over USB is suddenly
> > disconnected, which leads to removal of lots of kernel devices on
> > that bus and a lot of print messages as well, due to failures for
> > sending any more data for those devices..
>
> I see.
>
> > Observations:
> >
> > I tried to get more into it and found this..
> >
> > - The timer used by the delayed work fires at the time it was
> > programmed for (checked timer->expires with value of jiffies) and
> > the work-handler gets a chance to run and reset the counters pretty
> > quickly after that.
>
> Hmmm...
>
> > - But somehow, the timer isn't programmed for the right time.
> >
> > - Something is happening between the time the work-handler starts
> > running and we read jiffies from the add_timer() function which gets
> > called from within the queue_delayed_work().
> >
> > - For example, if the value of jiffies in the pet_watchdog_work()
> > handler (before calling queue_delayed_work()) is say 1000000, then
> > the value of jiffies after the call to queue_delayed_work() has
> > returned becomes 1000310. i.e. it sometimes increases by a value of
> > over 300, which is 1 second in our setup. I have seen this delta to
> > vary from 50 to 350. If it crosses 300, the watchdog resets the
> > system (as it was programmed for 2 seconds).
>
> That's weird. Once the work item starts executing, there isn't much
> which can delay it. queue_delayed_work() doesn't even take any lock
> before reading jiffies. In the failing cases, what's jiffies right
> before and after pet_watchdog_work()? Can that take long?
>
> > So, we aren't able to queue the next timer in time and that causes all
> > these problems. I haven't concluded on why is that so..
> >
> > Questions:
> >
> > - I hope that the wq handler can be preempted, but can it be this bad?
>
> It doesn't get preempted more than any other kthread w/ -20 nice
> value, so in most systems it shouldn't get preempted at all.
>
> > - Is it fine to use the wq-handler for petting the watchdog? Or should
> > that only be done with help of interrupt-handlers?
>
> It's absoultely fine but you'd prolly want WQ_HIGHPRI |
> WQ_MEM_RECLAIM.
>
> > - Any other clues you can give which can help us figure out what's
> > going on?
> >
> > Thanks in advance and sorry to bother you :)
>
> I'd watch sched TPs and see what actually is going on. The described
> scenario should work completely fine.
Okay, so I did trace what's going on the CPU for that long and I know
what it is now. The *print* messages :)
I enabled traces with '-e all' to look at everything happening on the
CPU.
Following is what starts in the middle of the delayed-work handler:
kworker/0:1H-40 [000] d..1 2994.918766: console: [ 2994.918754] ***
kworker/0:1H-40 [000] d..1 2994.923057: console: [ 2994.918817] ***
kworker/0:1H-40 [000] d..1 2994.935639: console: [ 2994.918842] ***
... (more messages)
kworker/0:1H-40 [000] d..1 2996.264372: console: [ 2994.943074] ***
kworker/0:1H-40 [000] d..1 2996.268622: console: [ 2994.943410] ***
kworker/0:1H-40 [000] d..1 2996.275050: ipi_entry: (Rescheduling interrupts)
(*** are some specific errors to our usecase and aren't relevant here)
These print messages continue from 2994.918 to 2996.268 (1.35 seconds)
and they hog the work-handler for that long, which results in watchdog
reboot in our setup. The 3.10 kernel implementation of the printk
looks like this (if I am not wrong):
local_irq_save();
flush-console-buffer(); //console_unlock()
local_irq_restore();
So, the current CPU will try to print all the messages from the
buffer, before enabling the interrupts again on the local CPU and so I
don't see the hrtimer fire at all for almost a second.
I tried looking at if something related to this changed between 3.10
and mainline, and found few patches at least. One of the important
ones is:
commit 5874af2003b1 ("printk: enable interrupts before calling
console_trylock_for_printk()")
I wasn't able to backport it cleanly to 3.10 yet to see it makes thing
work better though. But it looks like it was targeting similar
problems.
@Jan Kara, Right ?
Any other useful piece of advice/information you guys have? I see Jan,
Andrew and Steven do a bunch of stuff in printk.c to make it better
and so Cc'ing them.
--
viresh