Re: [PATCH 4.13 20/27] Revert "firmware: add sanity check on shutdown/suspend"

From: Luis R. Rodriguez
Date: Tue Sep 12 2017 - 13:20:42 EST


On Tue, Sep 12, 2017 at 10:00:00AM -0700, Greg Kroah-Hartman wrote:
> 4.13-stable review patch. If anyone has any objections, please let me know.
>
> ------------------
>
> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>
> commit f007cad159e99fa2acd3b2e9364fbb32ad28b971 upstream.
>
> This reverts commit 81f95076281fdd3bc382e004ba1bce8e82fccbce.

I'm not convinced reverting this commit is the right thing to do at
this point given it would seem other errors were happening prior to
v4.13 and this commit would rather just bring to light the core of
the issue which needs to be addressed.

If reverting this commit please consider reverting also commit
06a45a93e7d34a ("firmware: move umh try locks into the umh code").

> It causes random failures of firmware loading at resume time (well,
> random for me, it seems to be more reliable for others)

That randomness is expected, as well as the old issues without this
commit and commit 06a45a93e7d34a. By removing this commit though we
should also be introducing another vector of possible issues though,
and I'm afraid that it may be hard to predict when they were caused
by a race issue on resume.

> because the
> firmware disabling is not actually synchronous with any particular
> resume event, and at least the btusb driver that uses a workqueue to
> load the firmware at resume seems to occasionally hit the "firmware
> loading is disabled" logic because the firmware loader hasn't gotten the
> resume event yet.

It would seem that without these commits the issue was present also:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1356076

These commits just bring forward the issues closer to attention. They
have my attention now and I am looking at a clean way to address it
with Marcel.

> Some kind of sanity check for not trying to load firmware when it's not
> possible might be a good thing, but this commit was not it.
>
> Greg seems to have silently suffered the same issue, and pointed to the
> likely culprit, and Gabriel C verified the revert fixed it for him too.

If testing against old behaviour the right way would be to also revert
commit 06a45a93e7d34a ("firmware: move umh try locks into the umh code").

Luis

>
> Reported-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Pointed-at-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> Tested-by: Gabriel C <nix.or.die@xxxxxxxxx>
> Cc: Luis R. Rodriguez <mcgrof@xxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
>
> ---
> Documentation/driver-api/firmware/request_firmware.rst | 11 -
> drivers/base/firmware_class.c | 99 -----------------
> 2 files changed, 110 deletions(-)
>
> --- a/Documentation/driver-api/firmware/request_firmware.rst
> +++ b/Documentation/driver-api/firmware/request_firmware.rst
> @@ -44,17 +44,6 @@ request_firmware_nowait
> .. kernel-doc:: drivers/base/firmware_class.c
> :functions: request_firmware_nowait
>
> -Considerations for suspend and resume
> -=====================================
> -
> -During suspend and resume only the built-in firmware and the firmware cache
> -elements of the firmware API can be used. This is managed by fw_pm_notify().
> -
> -fw_pm_notify
> -------------
> -.. kernel-doc:: drivers/base/firmware_class.c
> - :functions: fw_pm_notify
> -
> request firmware API expected driver use
> ========================================
>
> --- a/drivers/base/firmware_class.c
> +++ b/drivers/base/firmware_class.c
> @@ -256,38 +256,6 @@ static int fw_cache_piggyback_on_request
> * guarding for corner cases a global lock should be OK */
> static DEFINE_MUTEX(fw_lock);
>
> -static bool __enable_firmware = false;
> -
> -static void enable_firmware(void)
> -{
> - mutex_lock(&fw_lock);
> - __enable_firmware = true;
> - mutex_unlock(&fw_lock);
> -}
> -
> -static void disable_firmware(void)
> -{
> - mutex_lock(&fw_lock);
> - __enable_firmware = false;
> - mutex_unlock(&fw_lock);
> -}
> -
> -/*
> - * When disabled only the built-in firmware and the firmware cache will be
> - * used to look for firmware.
> - */
> -static bool firmware_enabled(void)
> -{
> - bool enabled = false;
> -
> - mutex_lock(&fw_lock);
> - if (__enable_firmware)
> - enabled = true;
> - mutex_unlock(&fw_lock);
> -
> - return enabled;
> -}
> -
> static struct firmware_cache fw_cache;
>
> static struct firmware_buf *__allocate_fw_buf(const char *fw_name,
> @@ -1239,12 +1207,6 @@ _request_firmware(const struct firmware
> if (ret <= 0) /* error or already assigned */
> goto out;
>
> - if (!firmware_enabled()) {
> - WARN(1, "firmware request while host is not available\n");
> - ret = -EHOSTDOWN;
> - goto out;
> - }
> -
> ret = fw_get_filesystem_firmware(device, fw->priv);
> if (ret) {
> if (!(opt_flags & FW_OPT_NO_WARN))
> @@ -1755,62 +1717,6 @@ static void device_uncache_fw_images_del
> msecs_to_jiffies(delay));
> }
>
> -/**
> - * fw_pm_notify - notifier for suspend/resume
> - * @notify_block: unused
> - * @mode: mode we are switching to
> - * @unused: unused
> - *
> - * Used to modify the firmware_class state as we move in between states.
> - * The firmware_class implements a firmware cache to enable device driver
> - * to fetch firmware upon resume before the root filesystem is ready. We
> - * disable API calls which do not use the built-in firmware or the firmware
> - * cache when we know these calls will not work.
> - *
> - * The inner logic behind all this is a bit complex so it is worth summarizing
> - * the kernel's own suspend/resume process with context and focus on how this
> - * can impact the firmware API.
> - *
> - * First a review on how we go to suspend::
> - *
> - * pm_suspend() --> enter_state() -->
> - * sys_sync()
> - * suspend_prepare() -->
> - * __pm_notifier_call_chain(PM_SUSPEND_PREPARE, ...);
> - * suspend_freeze_processes() -->
> - * freeze_processes() -->
> - * __usermodehelper_set_disable_depth(UMH_DISABLED);
> - * freeze all tasks ...
> - * freeze_kernel_threads()
> - * suspend_devices_and_enter() -->
> - * dpm_suspend_start() -->
> - * dpm_prepare()
> - * dpm_suspend()
> - * suspend_enter() -->
> - * platform_suspend_prepare()
> - * dpm_suspend_late()
> - * freeze_enter()
> - * syscore_suspend()
> - *
> - * When we resume we bail out of a loop from suspend_devices_and_enter() and
> - * unwind back out to the caller enter_state() where we were before as follows::
> - *
> - * enter_state() -->
> - * suspend_devices_and_enter() --> (bail from loop)
> - * dpm_resume_end() -->
> - * dpm_resume()
> - * dpm_complete()
> - * suspend_finish() -->
> - * suspend_thaw_processes() -->
> - * thaw_processes() -->
> - * __usermodehelper_set_disable_depth(UMH_FREEZING);
> - * thaw_workqueues();
> - * thaw all processes ...
> - * usermodehelper_enable();
> - * pm_notifier_call_chain(PM_POST_SUSPEND);
> - *
> - * fw_pm_notify() works through pm_notifier_call_chain().
> - */
> static int fw_pm_notify(struct notifier_block *notify_block,
> unsigned long mode, void *unused)
> {
> @@ -1824,7 +1730,6 @@ static int fw_pm_notify(struct notifier_
> */
> kill_pending_fw_fallback_reqs(true);
> device_cache_fw_images();
> - disable_firmware();
> break;
>
> case PM_POST_SUSPEND:
> @@ -1837,7 +1742,6 @@ static int fw_pm_notify(struct notifier_
> mutex_lock(&fw_lock);
> fw_cache.state = FW_LOADER_NO_CACHE;
> mutex_unlock(&fw_lock);
> - enable_firmware();
>
> device_uncache_fw_images_delay(10 * MSEC_PER_SEC);
> break;
> @@ -1886,7 +1790,6 @@ static void __init fw_cache_init(void)
> static int fw_shutdown_notify(struct notifier_block *unused1,
> unsigned long unused2, void *unused3)
> {
> - disable_firmware();
> /*
> * Kill all pending fallback requests to avoid both stalling shutdown,
> * and avoid a deadlock with the usermode_lock.
> @@ -1902,7 +1805,6 @@ static struct notifier_block fw_shutdown
>
> static int __init firmware_class_init(void)
> {
> - enable_firmware();
> fw_cache_init();
> register_reboot_notifier(&fw_shutdown_nb);
> #ifdef CONFIG_FW_LOADER_USER_HELPER
> @@ -1914,7 +1816,6 @@ static int __init firmware_class_init(vo
>
> static void __exit firmware_class_exit(void)
> {
> - disable_firmware();
> #ifdef CONFIG_PM_SLEEP
> unregister_syscore_ops(&fw_syscore_ops);
> unregister_pm_notifier(&fw_cache.pm_notify);
>
>
>

--
Luis Rodriguez, SUSE LINUX GmbH
Maxfeldstrasse 5; D-90409 Nuernberg