Re: [PATCH printk v5 1/1] printk: extend console_lock for per-console locking

From: Petr Mladek
Date: Tue May 03 2022 - 02:49:14 EST


On Tue 2022-05-03 01:13:19, Marek Szyprowski wrote:
> Hi Petr,
>
> On 02.05.2022 15:17, Petr Mladek wrote:
> > On Mon 2022-05-02 11:19:07, Marek Szyprowski wrote:
> >> On 30.04.2022 18:00, John Ogness wrote:
> >>> On 2022-04-29, Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> wrote:
> >>>> The same issue happens if I boot with init=/bin/bash
> >>> Very interesting. Since you are seeing all the output up until you try
> >>> doing something, I guess receiving UART data is triggering the issue.
> >> Right, this is how it looks like.
> >>
> >>>> I found something really interesting. When lockup happens, I'm still
> >>>> able to log via ssh and trigger any magic sysrq action via
> >>>> /proc/sysrq-trigger.
> >>> If you boot the system and directly login via ssh (without sending any
> >>> data via serial), can you trigger printk output on the UART? For
> >>> example, with:
> >>>
> >>> echo hello > /dev/kmsg
> >>>
> >>> (You might need to increase your loglevel to see it.)
> >> Data written to /dev/kmsg and all kernel logs were always displayed
> >> correctly. Also data written directly to /dev/ttyAML0 is displayed
> >> properly on the console. The latter doesn't however trigger the input
> >> related activity.
> >>
> >> It looks that the data read from the uart is delivered only if other
> >> activity happens on the kernel console. If I type 'reboot' and press
> >> enter, nothing happens immediately. If I type 'date >/dev/ttyAML0' via
> >> ssh then, I only see the date printed on the console. However if I type
> >> 'date >/dev/kmsg', the the date is printed and reboot happens.
> > This is really interesting.
> >
> > 'date >/dev/kmsg' should be handled like a normal printk().
> > It should get pushed to the console using printk kthread,
> > that calls call_console_driver() that calls con->write()
> > callback. In our case, it should be meson_serial_console_write().
> >
> > I am not sure if meson_serial_console_write() is used also
> > when writing via /dev/ttyAML0.
> >
> >>>> It turned out that the UART console is somehow blocked, but it
> >>>> receives and buffers all the input. For example after issuing "echo
> >>>> >/proc/sysrq-trigger" from the ssh console, the UART console has been
> >>>> updated and I see the magic sysrq banner and then all the commands I
> >>>> blindly typed in the UART console! However this doesn't unblock the
> >>>> console.
> >>> sysrq falls back to direct printing. This would imply that the kthread
> >>> printer is somehow unable to print.
> >>>
> >>>> Here is the output of 't' magic sys request:
> >>>>
> >>>> https://protect2.fireeye.com/v1/url?k=8649f24d-e73258c4-86487902-74fe48600034-a2ca6bb18361467d&q=1&e=1bc4226f-a422-42b9-95e8-128845b8609f&u=https%3A%2F%2Fpastebin.com%2FfjbRuy4f
> >>> It looks like the call trace for the printing kthread (pr/ttyAML0) is
> >>> corrupt.
> >> Right, good catch. For comparison, here is a 't' sysrq result from the
> >> 'working' serial console (next-20220429), which happens usually 1 of 4
> >> boots:
> >>
> >> https://protect2.fireeye.com/v1/url?k=610106f1-008a13b6-61008dbe-000babff99aa-11083c39c44861df&q=1&e=2eafad9e-c5d2-4696-9d78-f3b5885256dc&u=https%3A%2F%2Fpastebin.com%2Fmp8zGFbW
> > Strange. The backtrace is weird here too:
> >
> > [ 50.514509] task:pr/ttyAML0 state:R running task stack: 0 pid: 65 ppid: 2 flags:0x00000008
> > [ 50.514540] Call trace:
> > [ 50.514548] __switch_to+0xe8/0x160
> > [ 50.514561] meson_serial_console+0x78/0x118
> >
> > There should be kthread() and printk_kthread_func() on the stack.
> >
> > Hmm, meson_serial_console+0x78/0x118 is weird on its own.
> > meson_serial_console is the name of the structure. I would
> > expect a name of the .write callback, like
> > meson_serial_console_write()
>
> Well, this sounds a bit like a stack corruption. For the reference, I've
> checked what magic sysrq 't' reports for my other test boards:
>
> RaspberryPi4:
>
> [  166.702431] task:pr/ttyS0        state:R stack:    0 pid:   64
> ppid:     2 flags:0x00000008
> [  166.711069] Call trace:
> [  166.713647]  __switch_to+0xe8/0x160
> [  166.717216]  __schedule+0x2f4/0x9f0
> [  166.720862]  log_wait+0x0/0x50
> [  166.724081] task:vfio-irqfd-clea state:I stack:    0 pid:   65
> ppid:     2 flags:0x00000008
> [  166.732698] Call trace:
>
>
> ARM Juno R1:
>
> [   74.356562] task:pr/ttyAMA0      state:R  running task stack:    0
> pid:   47 ppid:     2 flags:0x00000008
> [   74.356605] Call trace:
> [   74.356617]  __switch_to+0xe8/0x160
> [   74.356637]  amba_console+0x78/0x118
> [   74.356657] task:kworker/2:1     state:I stack:    0 pid:   48
> ppid:     2 flags:0x00000008
> [   74.356695] Workqueue:  0x0 (mm_percpu_wq)
> [   74.356738] Call trace:
>
>
> QEMU virt/arm64:
>
> [  174.155760] task:pr/ttyAMA0      state:S stack:    0 pid:   26
> ppid:     2 flags:0x00000008
> [  174.156305] Call trace:
> [  174.156529]  __switch_to+0xe8/0x160
> [  174.157131]  0xffff5ebbbfdd62d8

You mentioned in the other mail that the other boards work as
expected. I mean that console gets stuck only on the meson board.
Is it true, please?

The stack looks really weird. But another weird thing is that
even the meson board is able to show the messages, for example,
using echo hello >/dev/kmsg. It suggests that the kthreads
somehow work.

There is also a possibility that this code path is optimized
some special way and the unwinder has troubles to show
the stack correctly.


> In the last case it doesn't happen always. In the other runs I got the
> following log from QEMU virt/arm64:
>
> [  200.537579] task:pr/ttyAMA0      state:S stack:    0 pid:   26
> ppid:     2 flags:0x00000008
> [  200.538121] Call trace:
> [  200.538341]  __switch_to+0xe8/0x160
> [  200.538583]  __schedule+0x2f4/0x9f0
> [  200.538822]  schedule+0x54/0xd0
> [  200.539047]  printk_kthread_func+0x2d8/0x3bc
> [  200.539301]  kthread+0x118/0x11c
> [  200.539523]  ret_from_fork+0x10/0x20

This is what I would expect when the kthread is in an interruptible
sleep waiting for new strings to handle.

BTW: This is what I see on my x86_64 test system:

[61892.932242] task:pr/tty0 state:S stack: 0 pid: 14 ppid: 2 flags:0x00004000
[61892.932250] Call Trace:
[61892.932253] <TASK>
[61892.932263] __schedule+0x376/0xbb0
[61892.932284] schedule+0x44/0xb0
[61892.932290] printk_kthread_func+0x18f/0x370
[61892.932303] ? schedstat_stop+0x10/0x10
[61892.932316] ? console_start+0x30/0x30
[61892.932322] kthread+0xf2/0x120
[61892.932327] ? kthread_complete_and_exit+0x20/0x20
[61892.932338] ret_from_fork+0x1f/0x30
[61892.932370] </TASK>
[61892.932373] task:pr/ttyS0 state:R running task stack: 0 pid: 15 ppid: 2 flags:0x00004000
[61892.932391] Call Trace:
[61892.932398] <TASK>
[61892.932427] ? printk_kthread_func+0x15b/0x370
[61892.932436] ? schedstat_stop+0x10/0x10
[61892.932449] ? console_start+0x30/0x30
[61892.932455] ? kthread+0xf2/0x120
[61892.932460] ? kthread_complete_and_exit+0x20/0x20
[61892.932471] ? ret_from_fork+0x1f/0x30
[61892.932502] </TASK>


pr/tty0 is in the interruptible sleep and the stack looks reasonable.

pr/ttyS0 is in runnable state and the stack is weird. The '?' means
that this address was found on the stack, it belongs to some
function but the unwinder is not able to assign it to the current
call path by going back via the return addresses stored on the stack.


> I hope that at least the qemu case will let you to analyze it by
> yourself. I run my test system with the following command:
>
> qemu-system-aarch64 -kernel virt/Image -append "console=ttyAMA0
> root=/dev/vda rootwait" -M virt -cpu cortex-a57 -smp 2 -m 1024 -device
> virtio-blk-device,drive=virtio-blk0 -drive
> file=qemu-virt-rootfs.raw,id=virtio-blk0,if=none,format=raw -netdev
> user,id=user -device virtio-net-device,netdev=user -display none

Thanks a lot for all the information. It is really helpful.

Best Regards,
Petr