Re: [PATCH v6 37/42] x86/restrl: Expand the width of dom_id by replacing mon_data_bits

From: James Morse
Date: Fri Feb 28 2025 - 14:53:38 EST


Hi Reinette,

On 20/02/2025 05:40, Reinette Chatre wrote:
> On 2/7/25 10:18 AM, James Morse wrote:
>> MPAM platforms retrieve the cache-id property from the ACPI PPTT table.
>> The cache-id field is 32 bits wide. Under resctrl, the cache-id becomes
>> the domain-id, and is packed into the mon_data_bits union bitfield.
>> The width of cache-id in this field is 14 bits.
>>
>> Expanding the union would break 32bit x86 platforms as this union is
>> stored as the kernfs kn->priv pointer. This saved allocating memory
>> for the priv data storage.
>>
>> The firmware on MPAM platforms have used the PPTT cache-id field to
>> expose the interconnect's id for the cache, which is sparse and uses
>> more than 14 bits. Use of this id is to enable PCIe direct cache
>> injection hints. Using this feature with VFIO means the value provided
>> by the ACPI table should be exposed to user-space.
>>
>> To support cache-id values greater than 14 bits, convert the
>> mon_data_bits union to a structure. This is allocated when the kernfs
>> file is created, and free'd when the monitor directory is rmdir'd.

>> Readers and writers must hold the rdtgroup_mutex, and readers should
>> check for a NULL pointer to protect against an open file preventing
>> the kernfs file from being free'd immediately after the rmdir call.

> The last sentence is difficult to parse and took me many reads. I see
> two major parts to this statement and if I understand correctly the current
> implementation combined with this patch does not support either.
> (a) "checking for a NULL pointer from readers"
> The reader is rdtgroup_mondata_show() and it starts by calling:
> rdtgrp = rdtgroup_kn_lock_live(of->kn);
> As I understand, on return of rdtgroup_kn_lock_live() the kernfs node
> "of->kn" may no longer exist. This seems to be an issue with current code
> also.
> Considering this, it seems to me that checking if of->kn->priv is NULL
> may be futile if of->kn may no longer exist.

Certainly true.
Because the lifetime is different to the existing pointer-abuse version, I just added the
checks to be on the safe side.

I'll rip this out.


> I think this also needs a reference to the data needed by the file or
> the data needs to be stashed away before the call to
> kernfs_break_active_protection().

I've tried to hit this problem, and been unable. I'm happy to write it off as theoretical.

In particular:
* rmdir a control group while holding the mbm_local_bytes file open for reading. Any read
after the parent control group has been destroyed gets -ENODEV, even though though
/proc/<pid>/fd shows the fd as open for reading. The kernel in question had lockdep and
kasan enabled)
* take all the CPUs in a domain offline while holding the mbm_local_bytes file open for
reading. Again, read attempts get -ENODEV.


> (b) "...being free'd immediately after the rmdir call"
> I believe this refers to expectation that one task may have the file open
> while another removes the resource group directory ("rmdir") with the
> assumption that the associated struct mon_data is removed during handling
> of rmdir.

This is what I was worried about - and it seemed worth chucking in a NULL check just in
case. Trying a bit harder to hit it - it now seems theoretical.


> In this implementation the monitoring data file's struct mon_data
> is only removed when a monitoring domain goes offline.

> That is, when the
> resource group remains intact while the monitoring data files associated with
> one domain is removed (for example when all CPUs associated with that domain
> goes offline). The "rmdir" to remove a resource group does not call this code
> (mon_rmdir_one_subdir()), nor does the cleanup of the default resource group's
> "kn_mondata".

Huh, its the path via user-space calling rmdir() that I was worried about. I hadn't
spotted that there are two of these and they aren't joined up!

This would leak the priv pointer when the user-space path via rmdir() just leaves the
cleanup to kernfs.

Fixing this produces even more spaghetti as domain-offline manipulates one domain in all
rdtgroup, whereas rmdir manipulates all domains in on rdtgroup. Its going to be noisy to
merge these two paths.


A simpler approach is to use the event kn->priv pointers in the default control group as
the canonical copy, which also saves memory. For mbm_total in a domain, every control and
monitor group has the same values in struct mon_data_bits - the RMID is found by walking
up the tree to find the struct rdtgroup.
As user-space can't rmdir the default control group, we only need to free it for
domain-offline, when we know all the files for that domain are going to be removed - which
we can rely on to avoid doing it in a particular order.


> I am trying to get a handle on the different lifetimes and if I understand
> correctly this implementation does not attempt to keep the struct mon_data
> accessible as long as the file is open.

No, but I think that concern is theoretical...

> I do not think I've discovered all the implications of this yet.


Thanks,

James