Re: power-off delay/hang due to commit 6d25be57 (mainline)

From: Stephen Berman
Date: Fri May 08 2020 - 17:31:32 EST


On Wed, 6 May 2020 23:57:13 +0200 Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> wrote:

> On 2020-05-01 17:46:48 [+0200], Stephen Berman wrote:
>> I'm experiencing a delay or hang in powering off my computer after `halt
>> -d -f -i -p' and I've bisected it to this commit in the mainline tree:
>
> You refer to a normal "poweroff" or is this some kind of "shutdown now"
> kind of thing? Unless I'm mistaken, the `halt' command above will turn
> off the machine without stopping user tasks (as in "reaching shutdown
> level").

What I actually do is `shutdown -h now' but my understanding is that (at
least in LFS) the last thing that happens when `shutdown -h now' runs is
effectively the same as `halt -d -f -i -p'. The reason I wrote that
above is because I did not start timing immediately when I entered
`shutdown -h now', since it displays messages of the actions it takes,
like unmounting filesystems and so on. I start timing after the last
message before powering off (or before the hang/delay), which in LFS is
"Bringing down the loopback interface". When powering off succeeds
normally, it takes ~4 seconds after that message appears. I've since
tested directly entering `halt -d -f -i -p' with kernel 5.1.0: there
were no messages like with `shutdown -h now' and after ~4 seconds, the
machine powered off, which seems to confirm my understanding. I also
tested `halt -d -f -i -p' with kernel 5.6.4, and here, too, there were
no messages like with `shutdown -h now' but after ~7 seconds this
appeared in the tty:

sd 4:0:0:0: [sda] tag#13 timing out command, waited 7s

and after two minutes, this:

sr 5:0:0:0: tag#14 timing out command, waited 120s

and then I pressed the reset button.

> â
>> The delays have varied in length from ~20 seconds to seeminingly
>> indefinitely long (the longest I've waited before pressing the start
>> button on the machine is 6 minutes). With kernels prior to this commit,
>> my machine powers off within 4 seconds after the halt invocation.
>
> So you say that normally the machine shuts down immediately but now it
> takes 20secs+ if it shuts down at all?

Yes.

> â
>> sr 5:0:0:0: tag#21 timing out command, waited 120s
>
> The CD-drive is polled regularly so it notices when a CD us inserted /
> removed. It seems that one of the requests wasn't answered and it ended
> up in timeout.

But the drive is always not in use and empty when I run `shutdown -h
now' (in fact I rarely use that drive at all). And the message I
referred to above "sd 4:0:0:0: [sda] tag#13 timing out command, waited
7s" seems to indicate that other polling requests also aren't answered.

> â
>> minutes. But after running emacs in a tty (and then killing it), the
>> delay was seemingly indefinite, as it is with my usual workflow with X,
>> emacs, firefox, etc. running (but I always kill them before invoking
>> `shutdown -h now' from a tty). During the bisection I tested each
>> kernel build by running startx, emacs, firefox, killing these, then
>> `shutdown -h now'. I've attached the git log of the bisection augmented
>> with power-off timing notes. (FWIW, when I reboot the machine with
>> `shutdown -r now' there is no delay with the problematic kernels.)
>
> So reboot is not affected, just the halt.

Yes.

> Could you please check if the
> "poweroff" variant is also affected?

I have tried `poweroff' and also `shutdown -hP now' and these make no
difference: with kernel 5.1.0 and earlier, the machine powers off
promptly, with 5.2.0 and later it doesn't.

> â
>> Another pointer I got was to the recent commit 62849a96, which fixes a
>> bug due to the commit that causes my problem. I applied 62849a96 to
>> 5.6.4, but it did not prevent the delay/hang.
>
> You should see a warning if you were facing the problem described in the
> commit.

I didn't see any warning, so that doesn't seem related to my issue.

> â
>> Please let me know if there's any
>> other information I can provide or anything else I can do to help with
>> debugging this issue. I hope you can fix it.
>
> Can you send dmesg output of the system / lspci -k? I'm mainly
> interested what drivers are bound to storage devices (you can send it
> offlist if you want).

I've attached them (I included dmesg both from the "good" kernel 5.1.0
and the "bad" kernel 5.6.4; they differ somewhat but I don't know if the
differences are significant).

> Can you log the output on the serial console?

How do I do that?

> If the commit you cited is really the problem then it would mean that a
> worker isn't scheduled for some reason. Could you please enable
> CONFIG_WQ_WATCHDOG to see if workqueue core code notices that a worker
> isn't making progress?

How will I know if that happens, is there a specific message in the tty?

Thanks for your reply.

Steve Berman

Attachment: dmesg-5.1.0
Description: Binary data

Attachment: dmesg-5.6.4
Description: Binary data

Attachment: lspci-k-5.6.4
Description: Binary data