Re: [PATCH v7 33/49] x86/resctrl: resctrl_exit() teardown resctrl but leave the mount point

From: James Morse
Date: Mon Mar 10 2025 - 14:17:07 EST


Hi Reinette,

On 07/03/2025 19:04, Reinette Chatre wrote:
> On 3/7/25 9:54 AM, James Morse wrote:
>> On 07/03/2025 04:45, Reinette Chatre wrote:
>>> On 2/28/25 11:58 AM, James Morse wrote:
>>>> resctrl_exit() was intended for use when the 'resctrl' module was unloaded.
>>>> resctrl can't be built as a module, and the kernfs helpers are not exported
>>>> so this is unlikely to change. MPAM has an error interrupt which indicates
>>>> the MPAM driver has gone haywire. Should this occur tasks could run with
>>>> the wrong control values, leading to bad performance for important tasks.
>>>> The MPAM driver needs a way to tell resctrl that no further configuration
>>>> should be attempted.
>>>>
>>>> Using resctrl_exit() for this leaves the system in a funny state as
>>>> resctrl is still mounted, but cannot be un-mounted because the sysfs
>>>> directory that is typically used has been removed. Dave Martin suggests
>>>> this may cause systemd trouble in the future as not all filesystems
>>>> can be unmounted.
>>>>
>>>> Add calls to remove all the files and directories in resctrl, and
>>>> remove the sysfs_remove_mount_point() call that leaves the system
>>>> in a funny state. When triggered, this causes all the resctrl files
>>>> to disappear. resctrl can be unmounted, but not mounted again.
>>
>>> (copying v6 discussion here)
>>
>>> On 3/6/25 11:28 AM, James Morse wrote:
>>>> On 01/03/2025 02:35, Reinette Chatre wrote:
>>>>> On 2/28/25 11:54 AM, James Morse wrote:
>>>>>> On 20/02/2025 04:42, Reinette Chatre wrote:
>>>
>>>>>>> It is difficult for me to follow the kernfs reference counting required
>>>>>>> to make this work. Specifically, the root kn is "destroyed" here but it
>>>>>>> is required to stick around until unmount when the rest of the files
>>>>>>> are removed.
>>>>>>
>>>>>> This drops resctrl's reference to all of the files, which would make the files disappear.
>>>>>> unmount is what calls kernfs_kill_sb(), which gets rid of the root of the filesystem.
>>>>>
>>>>> My concern is mostly with the kernfs_remove() calls in the rdt_kill_sb()->rmdir_all_sub()
>>>>> flow. For example:
>>>>> kernfs_remove(kn_info);
>>>>> kernfs_remove(kn_mongrp);
>>>>> kernfs_remove(kn_mondata);
>>>>>
>>>>> As I understand the above require the destroyed root to still be around.
>>
>>>> Right - because rdt_get_tree() has these global pointers into the hierarchy, but doesn't
>>>> take a reference. rmdir_all_sub() relies on always being called before
>>>> rdtgroup_destroy_root().
>>
>>> Is this a known issue then?
>>
>> Its just a subtle thing that resctrl was relying on, and I didn't spot. Thanks for
>> pointing it out!
>>
>>
>>> Since I am not able to use your test I created something new
>>> after thinking there would be no response to my comment and indeed on unmount:
>>> [ 293.707228] BUG: KASAN: slab-use-after-free in kernfs_remove+0x87/0xa0
>>> [ 293.714718] Read of size 8 at addr ff11000309d88f30 by task umount/3793
>>>> The point hack would be for rdtgroup_destroy_root() to NULL out those global pointers, (I
>>>> note they are left dangling) - that would make a subsequent call to rmdir_all_sub() harmless.
>>>>
>>>> A better fix would be to pull out all the filesystem relevant parts from rdt_kill_sb(),
>>>> make that safe for multiple calls and get resctrl_exit() to call that.
>>>> A call to rdt_kill_sb() after resctrl_exit() would just cleanup the super-block.
>>>> This will leave things in a more predictable state.
>>
>>
>>> Why just the filesystem relevant parts?
>>
>> The stated aim is to prevent any further configuration of the hardware.
>> We can change that to 'unmount as much as possible'. I didn't think of it like that as I
>> couldn't find an in-kernel way of triggering a umount(). (and the mount may be copied into
>> numerous namespaces)
>
> deactivate_locked_super() looked promising when I started digging but I
> quickly found myself in unfamiliar territory.

It's not just the super-block, struct vfsmount maps parts of the directory hierarchy to
mounts of filesystems, so a umount would have to take a path - but mount namespaces means
its a set of paths ... I quickly gave up on this approach!

On-disk filesystems respond with some IO-error for any access if the backend storage goes
away. This is what the MPAM driver does - it would be noisy to get resctrl to do the same,
but the existing resctrl_exit() does pretty much everything needed...


>>> Although, you also state "resctrl_exit() would just
>>> cleanup the super-block" that sounds like you are thinking about pulling out all reset work.
>>> This sounds reasonable to me. It really feels more appropriate to do proper cleanup and
>>> not just wipe the root while leaving everything else underneath it.
>>
>> After this discussion, my new proposal is to do this:
>> -------%<-------
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 0d74a6d98dba..cee4604a59c0 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -3063,6 +3063,21 @@ static void rmdir_all_sub(void)
>> kernfs_remove(kn_mondata);
>> }
>>
>> +static void resctrl_fs_teardown(void)
>> +{
>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> + /* Cleared by rdtgroup_destroy_root() */
>> + if (!rdtgroup_default.kn)
>> + return;
>> +
>> + rmdir_all_sub();
>> + rdt_pseudo_lock_release();
>> + rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>> + schemata_list_destroy();
>> + closid_exit();
>> + rdtgroup_destroy_root();
>> +}
>> +
>> static void rdt_kill_sb(struct super_block *sb)
>> {
>> struct rdt_resource *r;
>> @@ -3076,11 +3091,8 @@ static void rdt_kill_sb(struct super_block *sb)
>> for_each_alloc_capable_rdt_resource(r)
>> resctrl_arch_reset_all_ctrls(r);
>>
>> - rmdir_all_sub();
>> - rdt_pseudo_lock_release();
>> - rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>> - schemata_list_destroy();
>> - rdtgroup_destroy_root();
>> + resctrl_fs_teardown();
>> +
>> if (resctrl_arch_alloc_capable())
>> resctrl_arch_disable_alloc();
>> if (resctrl_arch_mon_capable())
>> resctrl_arch_disable_mon();
>> - closid_exit();
>> resctrl_mounted = false;
>> kernfs_kill_sb(sb);
>> mutex_unlock(&rdtgroup_mutex);
>> -------%<-------
>>
>> and call resctrl_fs_teardown() from resctrl_exit().
>>
>> This leaves a bunch of resctrl_arch_ calls behind, the reason is its the arch code that
>> calls resctrl_exit(), so it probably doesn't need to be told to disable things. For MPAM,
>> the work of resctrl_arch_reset_all_ctrls() will already have been done - and all these
>> calls will fail because the MPAM driver is blocking further access to the hardware.

> As I understand it the overflow and limbo handlers will keep running after a resctrl fs teardown
> done on error interrupt. As you mention in changelog this work is done because "The MPAM
> driver needs a way to tell resctrl that no further configuration should be attempted.".
> I believe these handlers may thus still try to interact (but not technically "configure"?)
> with the hardware at this point ... is this ok?

For MPAM this is fine because all the resctrl_arch_ calls are going to start returning
errors. I have a followup patch that makes the pr_warn_ratelimited() for the monitor
context a pr_warn_once() as once this thing goes off, the log starts to get filled up,
albeit with ratelimited messages. (that patch isn't part of this series because its not
possible to hit on x86).

I don't think its enough to just leave resctrl in place and report an error on any
configuration change - user-space may have created control groups at boot, then just move
tasks between them. Nothing about moving a task between existing control groups interacts
with the arch code, so it isn't possible for user-space to know that its requests are
being ignored. Hence the attempt to teardown the filesystem as far as possible.

(I'll add some of this thinking to the commit message for posterity)


> rdt_disable_ctx() is an odd one to keep with the resctrl_arch_ calls that remain in rdt_kill_sb().
> It may be a candidate for resctrl_fs_teardown() but unfortunately rdt_disable_ctx()
> blurs fs/arch parts. What I am focusing on is the set_mba_sc(false) within it. Seems like this
> may mean that the software controller will keep running unnecessarily.

Yes, the limbo and mbm_overflow 'threads' keep ticking after this call, but all their
accesses return an error.

The other trick the MPAM driver has here is to tell resctrl that all the CPUs (and
therefore domains) are offline. This is a prerequisite for free-ing any of the arch
allocated structures.

(if resctrl ever became a module, the lifetime of the cpuhp calls is something that would
have to belong to resctrl - but for now its up to the arch code to call)


Thanks,

James