Re: Linux 6.3-rc2

From: Guenter Roeck
Date: Mon Mar 13 2023 - 16:30:57 EST


On Mon, Mar 13, 2023 at 11:21:44AM -0700, Linus Torvalds wrote:
> On Mon, Mar 13, 2023 at 8:53 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> >
> > Warning backtraces in calls from ct_nmi_enter(),
> > seen randomly.
>
> Hmm.
>
> I suspect this one is a bug in the warning, not in the kernel,
> although I have no idea why it would have started happening now.
>
> This happens from an irq event, but that check is not *supposed* to
> happen at all from interrupts:
>
> * We dont accurately track softirq state in e.g.
> * hardirq contexts (such as on 4KSTACKS), so only
> * check if not in hardirq contexts:
>
> but I think that the ct_nmi_enter() function was called before the
> hardirq count had even been incremented.
>
> > Sample decoded stack trace:
>
> Hmm. That WARNING backtrace doesn't actually seem to follow the stack
> chain, so it only shows the irq stack, not where the irq happened.
>
> > Seen if CONFIG_DEBUG_LOCK_ALLOC=y and CONFIG_CONTEXT_TRACKING_IDLE=y.
> > It seems that rcu_read_lock_sched_held() can be true when entering an interrupt.
> >
> > The problem is not seen in v6.2, but occurs randomly on ToT with various
> > arm emulations.
>
> Strange. I must be wrong about this being a race on the warning
> itself, because that warning has been there for a long long time.
>
> Adding in some people who might have more of a clue. I'm thinking
> Frederic and Paul might know what's up with the context tracking, but
> I don't see why this would be arm-related or have started recently.
> But I do note that PeterZ did some rcuidle tracing cleanups that do
> end up affecting arm too.
>
> So adding PeterZ too.
>
> Original email with full details at
>
> https://lore.kernel.org/lkml/d915df60-d06b-47d4-8b47-8aa1bbc2aac7@xxxxxxxxxxxx/
>
> for added peeps.
>
> Anybody?
>

It gets weird. Bisect log below. Reverting the identified patch does
indeed seem to fix the problem, only I have no clue why this might
be the case. The patch looks completely innocent to me. Yet, I can
reliably reproduce the problem with v6.3-rc2, but at least so far I
have not been able to reproduce it with commit f3dd0c53370 reverted
(and I am trying on five different servers in parallel).

Guenter

---
# bad: [a5c95ca18a98d742d0a4a04063c32556b5b66378] Merge tag 'drm-next-2023-02-23' of git://anongit.freedesktop.org/drm/drm
# good: [c9c3395d5e3dcc6daee66c6908354d47bf98cb0c] Linux 6.2
git bisect start 'a5c95ca18a98' 'v6.2'
# good: [36289a03bcd3aabdf66de75cb6d1b4ee15726438] Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good 36289a03bcd3aabdf66de75cb6d1b4ee15726438
# bad: [0175ec3a28c695562a08fdccf73f2ec5ed744e2f] Merge tag 'regulator-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect bad 0175ec3a28c695562a08fdccf73f2ec5ed744e2f
# good: [cb6b2e11a42decea2afc77df73ec7326db1ac25f] devlink: Fix memleak in health diagnose callback
git bisect good cb6b2e11a42decea2afc77df73ec7326db1ac25f
# good: [3365777a6a2243f1cca5a441f2c89002d16fc580] net: phy: marvell: Use the unlocked genphy_c45_ethtool_get_eee()
git bisect good 3365777a6a2243f1cca5a441f2c89002d16fc580
# good: [700ed3bbb7a0bd5eeb805a2c2ba47a6d7b286745] ASoC: SOF: core/ipc4/mtl: Add support for PCM delay
git bisect good 700ed3bbb7a0bd5eeb805a2c2ba47a6d7b286745
# good: [4d4266e3fd321fadb628ce02de641b129522c39c] page_pool: add a comment explaining the fragment counter usage
git bisect good 4d4266e3fd321fadb628ce02de641b129522c39c
# good: [76f5aaabce492aa6991c28c96bb78b00b05d06c5] ASoC: soc-ac97: Return correct error codes
git bisect good 76f5aaabce492aa6991c28c96bb78b00b05d06c5
# good: [5661706efa200252d0e9fea02421b0a5857808c3] Merge branch 'topic/apple-gmux' into for-next
git bisect good 5661706efa200252d0e9fea02421b0a5857808c3
# bad: [603ac530f13506e6ce5db4ab953ede4d292c5327] Merge tag 'regmap-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
git bisect bad 603ac530f13506e6ce5db4ab953ede4d292c5327
# good: [b60417a9f2b890a8094477b2204d4f73c535725e] selftest: fib_tests: Always cleanup before exit
git bisect good b60417a9f2b890a8094477b2204d4f73c535725e
# bad: [064d7dcf51a82b480e953a15cca47e5df0426502] Merge tag 'sound-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad 064d7dcf51a82b480e953a15cca47e5df0426502
# good: [5b7c4cabbb65f5c469464da6c5f614cbd7f730f2] Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect good 5b7c4cabbb65f5c469464da6c5f614cbd7f730f2
# good: [7933b90b42896f5b6596e6a829bb31c5121fc2a9] Merge branch 'for-linus' into for-next
git bisect good 7933b90b42896f5b6596e6a829bb31c5121fc2a9
# bad: [f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc] bpf: add missing header file include
git bisect bad f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc
# first bad commit: [f3dd0c53370e70c0f9b7e931bbec12916f3bb8cc] bpf: add missing header file include