Re: default_enter_idle leaked IRQ state

From: Forza
Date: Wed Aug 02 2023 - 07:41:17 EST




On 2023-07-07 12:13, Forza wrote:


On 2023-07-06 12:54, Linux regression tracking (Thorsten Leemhuis) wrote:
On 06.07.23 10:08, Forza wrote:
On Wed, May 24, 2023 at 11:13:57AM +0200, David Sterba wrote:
[...]
A small update.

Thx for this.

I have been able test 6.2.16, all 6.3.x and 6.4.1 and they all show
the same issue.

I am now trying 6.1.37 since two days and have not been able to
reproduce this issue on any of my virtual qemu/kvm machines. Perhaps
this information is helpful in finding the root cause?

That means it's most likely a regression between v6.1..v6.2 (or
v6.1..v6.2.16 if we are unlucky) somewhere (from earlier in the thread
it sounds like it might not be Btrfs).

Agreed, I do not think this specific bug (cpuidle /  default_enter_idle leaked IRQ state) is Btrfs related. Some of the virtual machines I test on do not use Btrfs.

Which makes we wonder: how long do you usually need to reproduce the
issue? If it's not too long it might mean that a bisection is the best
way forward, unless some developer sits down and looks closely at the
logs. With a bit of luck some dev will do that; but if we are unlucky we
likely will need a bisection.


It has varied. Sometimes immediately upon boot, but can take several hours or a day before showing up.


Also, I forgot to say I was basing my kernels on gentoo-kernels, which has some patches against vanilla. Therefore I will I will compile a set of vanilla kernels from 6.1.37 until 6.4.2 and run them in my testing machines to see where the problem is happening.

This is not a fast system, so it will likely take several days. But I will keep you posted.

Meanwhile, if you think of any specific kernel debug options, tracing, etc, that I should enable, let me know

Should we change the Subject line of this email thread?

Have been running for a few weeks now. The earliest vanilla-kernel I have ben able to reproduce the issue with is 6.2.16, it only happened once over two weeks.

The issue still happens frequently on 6.4.x kernels. Here is the output from 6.4.7, and as you see happened only a few seconds after booting. This VM has only a minimal system installed:

# pstree
init-+-acpid
|-crond
|-dropbear---dropbear---bash---pstree
|-2*[getty]
|-qemu-ga---{qemu-ga}
|-supervise-daemo---tgtd---16*[{tgtd}]
|-syslogd
|-udhcpc
`-watchdog

# dmesg

[ 4.328170][ T1] Run /sbin/init as init process
[ 4.329279][ T1] with arguments:
[ 4.329282][ T1] /sbin/init
[ 4.329284][ T1] with environment:
[ 4.329286][ T1] HOME=/
[ 4.329288][ T1] TERM=linux
[ 11.540637][ T1992] Adding 1048572k swap on /dev/vda. Priority:-2 extents:1 across:1048572k DscFS
[ 12.917894][ T0] ------------[ cut here ]------------
[ 12.917907][ T0] default_enter_idle leaked IRQ state
[ 12.918001][ T0] WARNING: CPU: 2 PID: 0 at drivers/cpuidle/cpuidle.c:269 cpuidle_enter_state+0x35f/0x510
[ 12.918052][ T0] Modules linked in:
[ 12.918064][ T0] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.4.7-ksan-ubsan #1
[ 12.918070][ T0] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[ 12.918074][ T0] RIP: 0010:cpuidle_enter_state+0x35f/0x510
[ 12.918088][ T0] Code: 89 c2 83 e2 01 3c 01 0f 87 b1 01 00 00 84 d2 75 1d c6 05 ac bf 1f 02 01 90 49 8b 76 50 48 c7 c7 60 23 b1 ba e8 82 ef ed fc 90 <0f> 0b 90 90 fa e9 36 fd ff ff 83 c2 01 48 83 c1 40 41 39 d0 0f 8e
[ 12.918093][ T0] RSP: 0018:ffffc9000015fd98 EFLAGS: 00010246
[ 12.918112][ T0] RAX: 0000000000000000 RBX: ffffe8ffffd08c00 RCX: 0000000000000000
[ 12.918116][ T0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 12.918119][ T0] RBP: ffffffffbc39ae40 R08: 0000000000000000 R09: 0000000000000000
[ 12.918123][ T0] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 12.918125][ T0] R13: 0000000000000001 R14: ffffffffbc39aec0 R15: 0000000000000000
[ 12.918130][ T0] FS: 0000000000000000(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
[ 12.918146][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.918150][ T0] CR2: 00007f8dbf5bc000 CR3: 000000000f880000 CR4: 00000000003506e0
[ 12.918154][ T0] Call Trace:
[ 12.918156][ T0] <TASK>
[ 12.918160][ T0] ? __warn+0xc9/0x240
[ 12.918198][ T0] ? cpuidle_enter_state+0x35f/0x510
[ 12.918203][ T0] ? report_bug+0x261/0x2c0
[ 12.918219][ T0] ? __pfx_default_enter_idle+0x10/0x10
[ 12.918226][ T0] ? handle_bug+0x3d/0x80
[ 12.918234][ T0] ? exc_invalid_op+0x13/0x50
[ 12.918238][ T0] ? asm_exc_invalid_op+0x16/0x20
[ 12.918251][ T0] ? cpuidle_enter_state+0x35f/0x510
[ 12.918255][ T0] ? cpuidle_enter_state+0x35e/0x510
[ 12.918269][ T0] ? tick_nohz_stop_tick+0x22c/0x660
[ 12.918288][ T0] cpuidle_enter+0x4a/0xb0
[ 12.918294][ T0] do_idle+0x31c/0x400
[ 12.918304][ T0] ? __pfx_do_idle+0x10/0x10
[ 12.918308][ T0] cpu_startup_entry+0x19/0x20
[ 12.918311][ T0] start_secondary+0x241/0x2d0
[ 12.918323][ T0] ? __pfx_start_secondary+0x10/0x10
[ 12.918328][ T0] secondary_startup_64_no_verify+0xf4/0xfb
[ 12.918338][ T0] </TASK>
[ 12.918339][ T0] ---[ end trace 0000000000000000 ]---


kernel-6.4.7-config: https://paste.tnonline.net/files/A1IHEUjgCh8Y_config-6.4.7.txt
kernel-6.4.7-dmesg: https://paste.tnonline.net/files/YP11KEMpKXAI_dmesg-6.4.7.txt

~Forza

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.