Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex (was: 6.9.3 Hung tasks)
From: Linux regression tracking (Thorsten Leemhuis)
Date: Fri May 31 2024 - 04:40:08 EST
[adding the LED folks and the regressions list to the list of recipients]
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.
Lee, Pavel, could you look into below regression report please? Thread
starts here:
https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@xxxxxxxxxxxx/
Another report with somewhat similar symptom can be found here:
https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@xxxxxxxxx/
See also Russell's analysis of that report below (many many thx for
that, much appreciated Russel!).
To my untrained eyes all of this sounds a lot like we still have a 6.9
regression related to the LED code somewhere. Reminder, we had earlier
trouble, but that was avoided through other measures:
* 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") /
https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@xxxxxxxxxxxxxxx/
* c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") /
https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/
* 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal")
That iwlwifi commit even calls it self "work around". The developer that
submitted it bisected the problem to a LED merge, but sadly that was the
end of it. :-/
Ciao, Thorsten
On 30.05.24 16:04, Russell King (Oracle) wrote:
> On Thu, May 30, 2024 at 09:36:45AM -0400, Genes Lists wrote:
>> On Thu, 2024-05-30 at 08:53 -0400, Genes Lists wrote:
>> This report for 6.9.1 could well be the same issue:
>> https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@xxxxxxxxx/
>
> The reg_check_chans_work() thing in pid 285 is likely stuck on the
> rtnl lock. The same is true of pid 287.
>
> That will be because of the thread (pid 663) that's stuck in
> __dev_open()...led_trigger_register(), where the rtnl lock will have
> been taken in that path. It looks to me like led_trigger_register()
> is stuck waiting for read access with the leds_list_lock rwsem.
>
> There are only two places that take that rwsem in write mode, which
> are led_classdev_register_ext() and led_classdev_unregister(). None
> of these paths are blocking in v6.9.
>
> Pid 641 doesn't look significant (its probably waiting for either
> pid 285 or 287 to complete its work.)
>
> Pid 666 looks like it is blocked waiting for exclusive write-access
> on the leds_list_lock - but it isn't holding that lock. This means
> there must already be some other reader or writer holding this lock.
>
> Pid 722 doesn't look sigificant (same as pid 641).
>
> Pid 760 is also waiting for the rtnl lock.
>
> Pid 854, 855 also doesn't look sigificant (as pid 641).
>
> And then we get to pid 858. This is in set_device_name(), which
> was called from led_trigger_set() and led_trigger_register().
> We know from pid 663 that led_trigger_register() can take a read
> on leds_list_lock, and indeed it does and then calls
> led_match_default_trigger(), which then goes on to call
> led_trigger_set(). Bingo, this is why pid 666 is blocked, which
> then blocks pid 663. pid 663 takes the rtnl lock, which blocks
> everything else _and_ also blocks pid 858 in set_device_name().
>
> Lockdep would've found this... this is a classic AB-BA deadlock
> between the leds_list_lock rwsem and the rtnl mutex.
>
> I haven't checked to see how that deadlock got introduced, that's
> for someone else to do.
P.S.:
#regzbot report: /
#regzbot introduced: f5c31bcf604d
#regzbot duplicate:
https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@xxxxxxxxx/
#regzbot summary: leds: Hung tasks due to a AB-BA deadlock between the
leds_list_lock rwsem and the rtnl mutex