Re: [PATCH] x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocks

From: Yazen Ghannam
Date: Wed Oct 26 2022 - 11:39:35 EST


On Wed, Oct 26, 2022 at 02:04:04PM +0200, Greg KH wrote:
...
> > kobject_add() does a kobject_get() on the parent for each sysfs file it
> > adds.
> >
> > Therefore, in order to unwind the same setup work when the CPU goes
> > offline and the bank *references* only are being removed - the other
> > CPUs still share it - do a kobject_put() on the parent.
>
> Eeek, no!
>
> You can't decrement the reference on the parent, that could cause you to
> get dropped.
>
> And you can not reuse kobjects, is that the issue here? When you are
> done with one, you have to delete it. Then create a new one.
>
> No need to move anything around, just destory it all and then add new
> ones.
>

Yes, it seems like we're messing up refcounts as we try to manually manage the
life of shared kobjects.

But I take it this is not allowed, correct? So maybe the best solution is to
not do this sharing at all.

...
>
> What changed to cause problems? the kobject reference logic hasn't
> changed, was it some topology stuff?
>

Here's a snip from the commit message at the top of the thread:

"During kobject_del(), kobject->sd is removed. If the kobject is not part of
a kset with default_groups, then subsequent kobject_del() calls seem safe
even with kobject->sd == NULL.

Originally, the AMD MCA thresholding structures did not use default_groups.
And so the above behavior was not apparent.

However, a recent change implemented default_groups for the thresholding
structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
code path. In this case, the first kobject_del() may succeed and remove
kobject->sd. But subsequent kobject_del() calls will give a WARNing in
kernfs_remove_by_name_ns() since kobject->sd == NULL."

Basically, we're deleting a shared kobject. The first CPU gets to delete it,
and the following CPUs complain that we're deleting an already deleted
kobject.

Sorry Boris, it's been a while since I looked at this. What's the issue with
my original patch? I think this is the simplest way to fix the current
implementation. But we should probably get rid of this kobject sharing idea in
light of Greg's comments.

Thanks,
Yazen