mutex_lock issues during poweroff
From: Maxime Ripard
Date: Thu Sep 07 2017 - 08:16:39 EST
Hi,
We've been investigating a bug on our kernel for the last couple
monthes.
The scenario is this: we have an ARM board that embed an Allwinner A33
SoC. That board is using a PMIC connected to the SoC through a
proprietary bus, whose driver is in drivers/bus/sunxi-rsb.c. The
poweroff is implemented by sending a shutdown command to that PMIC.
http://elixir.free-electrons.com/linux/v4.9.47/source/drivers/mfd/axp20x.c#L743
That PMIC also serves other purposes, such as controlling the
regulators, but we also use it to get the various power supplies
state, and report them through our power supplies driver.
http://elixir.free-electrons.com/linux/v4.9.47/source/drivers/power/supply/axp20x_usb_power.c
http://elixir.free-electrons.com/linux/v4.12.11/source/drivers/power/supply/axp20x_ac_power.c
http://elixir.free-electrons.com/linux/v4.12.11/source/drivers/power/supply/axp20x_battery.c
The bug arises when we have those drivers enabled on a kernel 4.9.47
(or any 4.9 kernel. 4.8 also happens to show this). In some cases (1
out of 200-300 poweroff), the board will not poweroff. After digging
through this, it turns out that in such scenario, the mutex_lock we
have in the bus driver never returns.
Here: http://elixir.free-electrons.com/linux/v4.9.47/source/drivers/bus/sunxi-rsb.c#L379
Which means that we will never actually send the command, which also
explains why it powered on.
This gets weirder, since if we dump the return code of mutex_is_locked
right before a failing case, the mutex isn't already locked, so we
should not block or sleep at all.
If we disable the power supplies driver that poll the PMIC status on a
regular basis, it works, however we've never actually seen a
concurrent usage of that bus. In our practical cases, the mutex is
always unlocked.
If we remove the mutex_lock / _unlock entirely, we don't stall anymore
either, which seems to confirm something weird going on here.
One thing worth noting is that we couldn't reproduce the issue with a
4.13. We can't bisect really easily due to the amount of patches that
we still have on 4.9 and have all been merged since, but it seems like
the bug was fixed (either on purpose or as a side effect), and was
never sent to stable. Looking at the history of kernel/locking/mutex.c
during that window didn't really show anything obvious though.
If you have any ideas or spot something very wrong, I'd be happy to
hear about. Thanks!
Maxime
--
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
Attachment:
signature.asc
Description: PGP signature