Re: power-off delay/hang due to commit 6d25be57 (mainline)

From: Stephen Berman
Date: Thu May 14 2020 - 17:39:59 EST


On Thu, 14 May 2020 00:04:28 +0200 Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> wrote:

> On 2020-05-08 23:30:45 [+0200], Stephen Berman wrote:
>> > Can you log the output on the serial console?
>>
>> How do I do that?
>
> The spec for your mainboard says "serial port header". You would need to
> connect a cable there to another computer and log its output.
> The alternative would be to delay the output on the console and use a
> camera.

It's easiest for me to take a picture, since there isn't much output and
in any case the delay happens on it's own ;-). I'm sending you the
image (from kernel 5.6.4) off-list since even after reducing it it's 1.2
MB large.

>> > If the commit you cited is really the problem then it would mean that a
>> > worker isn't scheduled for some reason. Could you please enable
>> > CONFIG_WQ_WATCHDOG to see if workqueue core code notices that a worker
>> > isn't making progress?

I enabled that and also CONFIG_SOFTLOCKUP_DETECTOR,
CONFIG_HARDLOCKUP_DETECTOR and CONFIG_DETECT_HUNG_TASK, which had all
been unset previously.

>> How will I know if that happens, is there a specific message in the tty?
>
> On the tty console where you see the "timing out command, waited"
> message, there should be something starting with
> |BUG: workqueue lockup - pool
>
> following with the pool information that got stuck. That code checks the
> workqueues every 30secs by default. So if you waited >= 60secs then
> system is not detecting a stall.

As you can see in the photo, there was no message about a workqueue
lockup, only "task halt:5320 blocked for more than <XXX> seconds" every
two minutes. I suppose that comes from one of the other options I
enabled. Does it reveal anything about the problem?

> As far as I can tell, there is nothing special on your system. The CD
> and disk drives are served by the AHCI controller. There is no special
> SCSI/SATA/SAS controller.
> Right now I have no idea how the workqueues fit in the picture. Could
> you please check if the stall-dector says something?

Is that the message I repeated above or do you mean the workqueue?

> Is it possible to show me output when the timeout message comes? My
> guess is that the system is going down and before unounting/remount RO
> the filesystem it flushes its last data. But this is done before issuing
> the "halt-syscall".

The entire output from `shutdown -h now' is in the picture; after the
fourth "timing out command" message, I pressed the reset button.

Steve Berman