Re: ast: cursor flashing softlockups

From: David Daney
Date: Tue May 17 2016 - 14:07:56 EST


On 05/17/2016 10:39 AM, David Daney wrote:
I can confirm this.

I have a patch that I think is the proper fix. You should see it soon (after I test it a bit more)

Thanks,
David Daney


[...]


On Tue, May 17, 2016 at 4:07 AM, Dann Frazier
<dann.frazier@xxxxxxxxxxxxx> wrote:
Hi,
I'm observing a soft lockup issue w/ the ASPEED controller on an
arm64 server platform. This was originally seen on Ubuntu's 4.4
kernel, but it is reproducible w/ vanilla 4.6-rc7 as well.

[ 32.792656] NMI watchdog: BUG: soft lockup - CPU#38 stuck for 22s!
[swapper/38:0]

I observe this just once each time I boot into debian-installer (I'm
using a serial console, but the ast module gets loaded during
startup).

I have figured out that it is caused by 'mod_timer(timer, jiffies)' and
'ops->cur_blink_jiffies' is observed as zero in cursor_timer_handler()
when the issue happened.

Thanks for tracking this down.

This softlockup looks to be caused by:

commit 27a4c827c34ac4256a190cc9d24607f953c1c459
Author: Scot Doyle <lkml14@xxxxxxxxxxxxx>
Date: Thu Mar 26 13:56:38 2015 +0000

fbcon: use the cursor blink interval provided by vt

vt now provides a cursor blink interval via vc_data. Use this
interval instead of the currently hardcoded 200 msecs. Store
it in
fbcon_ops to avoid locking the console in cursor_timer_handler().

Signed-off-by: Scot Doyle <lkml14@xxxxxxxxxxxxx>
Acked-by: Pavel Machek <pavel@xxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

and

commit bd63364caa8df38bad2b25b11b2a1b849475cce5
Author: Scot Doyle <lkml14@xxxxxxxxxxxxx>
Date: Thu Mar 26 13:54:39 2015 +0000

vt: add cursor blink interval escape sequence

Add an escape sequence to specify the current console's cursor
blink
interval. The interval is specified as a number of
milliseconds until
the next cursor display state toggle, from 50 to 65535.
/proc/loadavg
did not show a difference with a one msec interval, but the lower
bound is set to 50 msecs since slower hardware wasn't tested.

Store the interval in the vc_data structure for later access
by fbcon,
initializing the value to fbcon's current hardcoded value of
200 msecs.

Signed-off-by: Scot Doyle <lkml14@xxxxxxxxxxxxx>
Acked-by: Pavel Machek <pavel@xxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>



Looks it is a real fbcon/vt issue, see following:

fbcon_init()
<-.con_init
<-visual_init()

reset_terminal()
<-vc_init()

vc->vc_cur_blink_ms is just set in reset_terminal() from vc_init() path,
and ops->cur_blink_jiffies is figured out from vc->vc_cur_blink_ms
in fbcon_init().

And visual_init() is always run before vc_init(), so
ops->cur_blink_jiffies
is initialized as zero and cause the soft lockup issue finally.

Thanks,
Ming


perf shows that the CPU caught by the NMI is typically in code
updating the cursor timer:

- 16.92% swapper [kernel.kallsyms] [k]
_raw_spin_unlock_irqrestore
- _raw_spin_unlock_irqrestore
+ 16.87% mod_timer
+ 0.05% cursor_timer_handler
- 12.15% swapper [kernel.kallsyms] [k] queue_work_on
- queue_work_on
+ 12.00% cursor_timer_handler
+ 0.15% call_timer_fn
+ 10.98% swapper [kernel.kallsyms] [k] run_timer_softirq
- 2.23% swapper [kernel.kallsyms] [k] mod_timer
- mod_timer
+ 1.97% cursor_timer_handler
+ 0.26% call_timer_fn

During the same period, I can see that another CPU is actively
executing the timer function:

- 42.18% kworker/u96:2 [kernel.kallsyms] [k] ww_mutex_unlock
- ww_mutex_unlock
- 40.70% ast_dirty_update
ast_imageblit
soft_cursor
bit_cursor
fb_flashcursor
process_one_work
worker_thread
kthread
ret_from_fork
+ 1.48% ast_imageblit
- 40.15% kworker/u96:2 [kernel.kallsyms] [k] __memcpy_toio
- __memcpy_toio
+ 31.54% ast_dirty_update
+ 8.61% ast_imageblit

Using the graph function tracer on fb_flashcursor(), I see that
ast_dirty_update usually takes around 60 us, in which it makes 16
calls to __memcpy_toio(). However, there is always one instance on
every boot of the installer where ast_dirty_update() takes ~98 *ms* to
complete, during which it makes 743 calls to __memcpy_toio(). While
that doesn't directly account for the full 22s, I do wonder if that
maybe a smoking gun.

fyi, this is being tracked at: https://bugs.launchpad.net/bugs/1574814

-dann