Re: [PATCH v6 33/42] x86/resctrl: resctrl_exit() teardown resctrl but leave the mount point

From: James Morse
Date: Thu Mar 06 2025 - 14:31:56 EST


Hi Reinette,

On 01/03/2025 02:35, Reinette Chatre wrote:
> On 2/28/25 11:54 AM, James Morse wrote:
>> On 20/02/2025 04:42, Reinette Chatre wrote:
>>> On 2/7/25 10:18 AM, James Morse wrote:
>>>> resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
>>>> resctrl can't be built as a module, and the kernfs helpers are not exported
>>>> so this is unlikely to change. MPAM has an error interrupt which indicates
>>>> the MPAM driver has gone haywire. Should this occur tasks could run with
>>>> the wrong control values, leading to bad performance for important tasks.
>>>> The MPAM driver needs a way to tell resctrl that no further configuration
>>>> should be attempted.
>>>>
>>>> Using resctrl_exit() for this leaves the system in a funny state as
>>>> resctrl is still mounted, but cannot be un-mounted because the sysfs
>>>> directory that is typically used has been removed. Dave Martin suggests
>>>> this may cause systemd trouble in the future as not all filesystems
>>>> can be unmounted.
>>>>
>>>> Add calls to remove all the files and directories in resctrl, and
>>>> remove the sysfs_remove_mount_point() call that leaves the system
>>>> in a funny state. When triggered, this causes all the resctrl files
>>>> to disappear. resctrl can be unmounted, but not mounted again.
>>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> index 6e30283358d4..424622d2f959 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>>> @@ -4371,9 +4375,12 @@ int __init resctrl_init(void)
>>>>
>>>
>>> Could you please add the kerneldoc you proposed in
>>> https://lore.kernel.org/lkml/f2ecb501-bc65-49a9-903d-80ba1737845f@xxxxxxx/ ?
>>
>> Huh. The way that is indented means I copied it out the file - I'm not sure went wrong
>> there. Thanks for fishing out the link!
>>
>>
>>>> void __exit resctrl_exit(void)
>>>> {
>>>> + mutex_lock(&rdtgroup_mutex);
>>>> + rdtgroup_destroy_root();
>>>> + mutex_unlock(&rdtgroup_mutex);
>>>> +
>>>> debugfs_remove_recursive(debugfs_resctrl);
>>>> unregister_filesystem(&rdt_fs_type);
>>>> - sysfs_remove_mount_point(fs_kobj, "resctrl");
>>>>
>>>> resctrl_mon_resource_exit();
>>>> }
>>>
>>> It is difficult for me to follow the kernfs reference counting required
>>> to make this work. Specifically, the root kn is "destroyed" here but it
>>> is required to stick around until unmount when the rest of the files
>>> are removed.
>>
>> This drops resctrl's reference to all of the files, which would make the files disappear.
>> unmount is what calls kernfs_kill_sb(), which gets rid of the root of the filesystem.
>
> My concern is mostly with the kernfs_remove() calls in the rdt_kill_sb()->rmdir_all_sub()
> flow. For example:
> kernfs_remove(kn_info);
> kernfs_remove(kn_mongrp);
> kernfs_remove(kn_mondata);
>
> As I understand the above require the destroyed root to still be around.

Right - because rdt_get_tree() has these global pointers into the hierarchy, but doesn't
take a reference. rmdir_all_sub() relies on always being called before
rdtgroup_destroy_root().

The point hack would be for rdtgroup_destroy_root() to NULL out those global pointers, (I
note they are left dangling) - that would make a subsequent call to rmdir_all_sub() harmless.

A better fix would be to pull out all the filesystem relevant parts from rdt_kill_sb(),
make that safe for multiple calls and get resctrl_exit() to call that.
A call to rdt_kill_sb() after resctrl_exit() would just cleanup the super-block.
This will leave things in a more predictable state.


>>> Have you been able to test this flow? I think you mentioned
>>> something like this before but I cannot find the details now.
>>
>> Yes:
>> https://web.git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/commit/?h=mpam/snapshot%2bextras/v6.14-rc1&id=8c96f858b25aa42694c5db56a2afe255ed8262dd
>>
>> This is a debugfs file that schedules the threaded bit of the MPAM error interrupt
>> handler. I figure its MPAM specific because there is no way into this code on x86.
>> (the aim is to get the CI to tickle this)



Thanks,

James