Re: [PATCH] mce: fix warning messages about static struct mce_device

From: Srivatsa S. Bhat
Date: Thu Jan 19 2012 - 08:30:15 EST


On 01/19/2012 06:02 PM, Ingo Molnar wrote:

>
> * Kay Sievers <kay.sievers@xxxxxxxx> wrote:
>
>>> There's nothing special about the driver model code in this
>>> respect. The same restriction applies wherever object
>>> lifetimes are controlled by reference counting.
>>
>> Right. But it might not be obvious what 's the background
>> here:
>>
>> An allocated device object(memory) usually represents an
>> actual device(hardware). The object can have N users. Every of
>> the users is required to take a reference to the object, which
>> pins the object's memory as long as any of the N users might
>> need to access it.
>>
>> In a hotplug world, we deal with device-removal. On
>> disconnect, we usually just orphan the object, we remove it
>> from visibility, disconnect the device <-> object relation.
>>
>> All of the N users with a reference can still access the
>> memory, they just do not talk to a real device anymore. The
>> invalidated/orphaned state is communicated otherwise by locks
>> and flags in the device object. Only after all of the N users
>> left the object alone, the memory of the orphan if free'd.
>
> But this is not what happened here - it's a special piece of
> fundamental hardware that doesnt hot-plug separately from the
> CPU and that has just a single "user".
>
> So i'm curious, why wasn't the memset() enough? It should have
> resolved the bug AFAICS.
>


It did! The memset _did_ fix the bug.

See commit a3301b7 (x86/mce: Fix CPU hotplug and suspend regression
related to MCE).

Just to clarify: the bug was that a CPU offline + CPU online would
lead to usage of stale pointers in some device structure related
to MCE and hence, suspend-resume would not work on the second attempt
to suspend. And (as expected), the other symptom of this bug was: a
CPU offline + CPU online would cause the machine to oops because it
tried to dereference some invalid pointer.

And the memset() fixed this bug. Completely.

But what still remained after the memset, was only a harmless warning
about machinecheck not having a release() function. This was only a
reflection of the semantics that the driver-core imposed, but not
really a bug as such. (And as I mentioned in one of my earlier posts,
this warning existed in much older kernels too, but was hidden because
pr_debug() was used to print it. Now that the callpaths changed after
the change over from sysdev to struct device, we now started hitting
a WARN(), instead of a mild pr_debug(). But the message conveyed
by either of these was exactly the same.)

So, the discussion in this thread was about how best to get rid of
that warning, by playing by the rules of the driver-core instead of
circumventing it by having a dummy release function just to silence
the warning.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/