[BUGFIX][PATCH] Freezer, CPU hotplug, x86 Microcode: Fix task freezingfailures

From: Srivatsa S. Bhat
Date: Sun Oct 02 2011 - 15:05:22 EST

This patch addresses the warnings found in the logs in the
task freezing failure bug reported in https://lkml.org/lkml/2011/9/5/28

The warnings appear because of the reason explained below:

There are microcode callbacks registered for CPU hotplug events such
as a CPU getting offlined or onlined. When a CPU is offlined
with tasks being frozen (as in the case of disabling the non-boot CPUs
while preparing for a system suspend operation), the CPU_DEAD_FROZEN
notification is sent, for which the microcode callback does not
do anything. In particular, it does not free or invalidate the CPU
microcode which it had got from userspace earlier. Hence when that CPU
comes back online with tasks being frozen (as in the case of re-enabling
the non-boot CPUs during a resume operation after suspend), the microcode
callback applies the microcode (which it already possesses) to that CPU.

However, during a pure CPU hotplug operation, tasks are not frozen and
hence the CPU_DEAD notification is sent. Upon this event notification,
the microcode callback frees the copy of microcode it has and
invalidates it. And during a CPU online, it tries to apply the microcode
to the CPU, but since it doesn't have the copy of the microcode, it depends
on a userspace utility to get the microcode. This is perfectly fine when
doing plain CPU hotplug operations alone.

Things go wrong when a CPU hotplug stress test is carried out along with
a suspend/resume operation running simultaneously. Upon getting a CPU_DEAD
notification (for example, when a CPU offline occurs with tasks not frozen),
the microcode callback frees up the microcode and invalidates it. Later
when that CPU gets onlined with tasks being frozen, the microcode callback
(for the CPU_ONLINE_FROZEN event) tries to apply the microcode to the CPU;
doesn't find it and hence depends on the (currently frozen) userspace to
get the microcode again. This leads to the numerous "WARNING"s at
drivers/base/firmware_class.c which eventually leads to task freezing failures
in the suspend code path, as has been reported.

So, this patch addresses this issue by ensuring that microcode is not freed
from kernel memory, nor invalidated when a CPU goes offline. Thus once the
kernel gets the microcode during boot-up, it will never have to depend on
userspace ever again to get microcode, since it never releases the copy it
already has. So every run of the microcode callback for CPU online event will
now succeed irrespective of whether userspace is frozen or not. As a result,
this fixes the task freezing failure encountered while running CPU hotplug
stress test along with suspend/resume operations simultaneously.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>

arch/x86/kernel/microcode_core.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/microcode_core.c b/arch/x86/kernel/microcode_core.c
index f924280..cd7ef2f 100644
--- a/arch/x86/kernel/microcode_core.c
+++ b/arch/x86/kernel/microcode_core.c
@@ -483,7 +483,15 @@ mc_cpu_callback(struct notifier_block *nb, unsigned long action, void *hcpu)
sysfs_remove_group(&sys_dev->kobj, &mc_attr_group);
pr_debug("CPU%d removed\n", cpu);
- case CPU_DEAD:
+ /*
+ * Do not invalidate the microcode if a CPU goes offline,
+ * because it would be impossible to get the microcode again
+ * from userspace when the CPU comes back up, if the userspace
+ * happens to be frozen at that moment by the freezer subsystem,
+ * for example, due to a suspend operation in progress.
+ */
/* The CPU refused to come up during a system resume */

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/