Re: [regression] 6.8.1: fails to hibernate with pm_runtime_force_suspend+0x0/0x120 returns -16

From: Linux regression tracking (Thorsten Leemhuis)
Date: Wed Apr 03 2024 - 00:49:33 EST


On 02.04.24 21:42, Martin Steigerwald wrote:
> Linux regression tracking (Thorsten Leemhuis) - 19.03.24, 09:40:06 CEST:
>> On 16.03.24 17:12, Martin Steigerwald wrote:
>>> Martin Steigerwald - 16.03.24, 17:02:44 CET:
>>>> ThinkPad T14 AMD Gen 1 fails to hibernate with self-compiled 6.8.1.
>>>> Hibernation works correctly with self-compiled 6.7.9.
>>>
>>> Apparently 6.8.1 does not even reboot correctly anymore. runit on
>>> Devuan. It says it is doing the system reboot but then nothing
>>> happens.
>>>
>>> As for hibernation the kernel cancels the attempt and returns back to
>>> user space desktop session.
>>>
>>>> Trying to use "no_console_suspend" to debug next. Will not do bisect
>>>> between major kernel releases on a production machine.
>>
>> FWIW, without a bisection I guess no developer will take a closer look
>> (but I might be wrong and you lucky here!), as any change in those
>> hundreds of drivers used on that machine can possibly lead to problems
>> like yours. So without a bisection we are likely stuck here, unless
>> someone else runs into the same problem and bisects or fixes it. Sorry,
>> but that's just how it is.
>
> I have been asked this repeatedly with previous bug reports. My issue
> with bisecting between major kernel versions is this:
>
> When I look around here I see no second ThinkPad T14 AMD Gen 1 here I
> could use for testing. Also doing a kernel bisect using a GRML live iso…
> not really.
>
> The one I reported this from is a production machine with a 4 TB NVMe
> SSD which contains a lot of data. I am not willing to risk data loss or
> (silent) file system corruption by bisecting between major kernel
> releases. Bisecting between major kernel releases in my understanding
> would require to test various releases between in this example 6.7 and
> 6.8 and even between 6.7 and 6.8-rc1. At least in my understand anything
> between 6.7 and 6.8-rc1 is not guaranteed to be even be somewhat stable.

It's hard to qualify and always a matter of personal viewpoint/opinion,
but I'd say: kernel from the merge window are pretty stable and
reliable. But sure, accidents that eat data happen and they happen
slightly more often during merge windows because the rate of change is
higher. But in the end they do not happen often, which is why Fedora
rawhide for example ships merge window kernels all the time.

> I
> am not usually installing an rc1 kernel on a production machine, but
> rather wait for at least rc2/3 nowadays. Its a balanced risk calculation.
> And rc2/3 or later appears to be a risk I am willing to take. But
> something between stable and rc1? Nope.

Well, that's up to you -- but the reality is also that developers are
not obliged to look into regressions report closely, unless someone
bisected it.

> It is not even that rare. 6.7 some rc failed with hibernation as well.

Maybe too few people (or too few of those that run the latest kernels)
use hibernate these days (I haven't for more than 15 years), which is
why it's not tested much.

> With exactly the same machine. I refused to do a bisect as well in that
> case. At some later time the issue was fixed without me doing anything
> more.

Maybe you were lucky, maybe someone else bisected and reported the problem.

> Now my question is this: Without me willing to bisect in that case, is
> a bug report even useful? Otherwise I may just switch this last machine
> to distribution kernels. It would save a lot of time for me. This private
> and freelancer production machine is the last left-over machine with self-
> compiled kernels.
>
> So far I still thought I would somehow be contributing to Linux kernel
> quality with detailed bug reports that take time to write, but apparently
> I am not. Can you clarify?

Not really, as it always depends on the situation. There are bugs (like
https://lore.kernel.org/all/08275279-7462-4f4a-a0ee-8aa015f829bc@xxxxxxxxxxxxx/
) where a report without a bisection is enough. But there are others
where it's unlikely that anyone will take a closer look; a lot of those
reg. suspend/hibernate fall into this category, as problems in that area
can be cause by any subsystem and its drivers -- which is why the power
management people can't look into most of those, as then they quickly
wouldn't get anything else done while spending time on bugs most of the
time other people caused.

Ciao, Thorsten