Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

From: Huang, Kai
Date: Tue Apr 16 2024 - 09:22:36 EST


On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx>
>
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
>
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
>
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
>
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Signed-off-by: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx>
> Co-developed-by: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx>
> Signed-off-by: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx>
> Reviewed-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>
> Reviewed-by: Tejun Heo <tj@xxxxxxxxxx>
> Tested-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>

I don't see any big issue, so feel free to add:

Reviewed-by: Kai Huang <kai.huang@xxxxxxxxx>

Nitpickings below:

[...]


> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>

It doesn't seem you need the above two here.

Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.

However I think it's better to explicitly include <linux/slab.h> since
kzalloc()/kfree() are used.

Btw, I am not sure whether you want to use <linux/kernel.h> because looks
it contains a lot of unrelated staff. Anyway I guess nobody cares.

> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;

The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.

[...]

>
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>

I don't see why you need <asm/sgx.h> here. Also, ...

> +#include <linux/cgroup.h>
> +#include <linux/misc_cgroup.h>
> +
> +#include "sgx.h"

... "sgx.h" already includes <asm/sgx.h>

[...]

>
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}

I spent some time looking into this. And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*]. And it would be helpful if any cgroup
expert can have a second eye on this.

[...]


> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>

Is this needed? I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?

[...]


[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).

Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.

int __init cgroup_init(void)
{
...

for_each_subsys(ss, ssid) {
if (ss->early_init) {
...
} else {
cgroup_init_subsys(ss, false);
}

...

if (!cgroup_ssid_enabled(ssid))
continue;
...
}
...
}

cgroup_init_subsys() makes a valid @css is allocated for MISC cgroup and
set the pointer to the @init_css_set.

static void __init cgroup_init_subsys(struct cgroup_subsys *ss,
...)
{
struct cgroup_subsys_state *css;

...
css = ss->css_alloc(NULL);
/* We don't handle early failures gracefully */
BUG_ON(IS_ERR(css));

...

init_css_set.subsys[ss->id] = css;

...
}

All processes are by default associated to the @init_css_set:

void cgroup_fork(struct task_struct *child)
{
RCU_INIT_POINTER(child->cgroups, &init_css_set);
INIT_LIST_HEAD(&child->cg_list);
}

At runtime, when a new cgroup is created in the hierarchy, the "cgroup"
can have a NULL @css if some subsystem is not enabled in it:

static int cgroup_apply_control_enable(struct cgroup *cgrp)
{
struct cgroup *dsct;
struct cgroup_subsys_state *d_css;
struct cgroup_subsys *ss;
int ssid, ret;

cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
for_each_subsys(ss, ssid) {
struct cgroup_subsys_state *css =
cgroup_css(dsct, ss);

if (!(cgroup_ss_mask(dsct) &
(1 << ss->id)))
continue;

if (!css) {
css = css_create(dsct, ss);
if (IS_ERR(css))
return PTR_ERR(css);
}
...
}
}
}

We can see if cgroup_ss_mask(dsct) doesn't have subsystem enabled, the
css_create() won't be invoked, and cgroup->subsys[ssid] will remain NULL.

However, when a process is bound to a specific cgroup, the kernel tries to
get the cgorup's "effective css", and it seems this "effective css" cannot
be NULL if the subsys has a valid 'struct cgroup_subsys' provided, which
the MISC cgroup does.

There are couple of code paths can lead to this, but they all reach to

static struct css_set *find_existing_css_set(...)
{
struct cgroup_root *root = cgrp->root;
struct cgroup_subsys *ss;
struct css_set *cset;
unsigned long key;
int i;


for_each_subsys(ss, i) {
if (root->subsys_mask & (1UL << i)) {
/*
* @ss is in this hierarchy, so we want
* the effective css from @cgrp.
*/
template[i] = cgroup_e_css_by_mask(cgrp,
ss);
} else {
/*
* @ss is not in this hierarchy, so we
* don't want to change the css.
*/
template[i] = old_cset->subsys[i];
}
}
...
}

Which calls cgroup_e_css_by_mask() to get the "effective css" when subsys
is enabled in the root cgroup (which means MISC cgroup is not disabled by
kernel command line), or get the default css, which is @init_css_set-
>subsys[ssid], which is always valid for MISC cgroup.

And more specifically, the "effective css" in the cgroup_e_css_by_mask()
is done by searching the entire hierarchy, so MISC cgroup will always have
a valid "effective css".

static struct cgroup_subsys_state *cgroup_e_css_by_mask(
struct cgroup *cgrp,
struct cgroup_subsys *ss)
{
lockdep_assert_held(&cgroup_mutex);

if (!ss)
return &cgrp->self;

...
while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
cgrp = cgroup_parent(cgrp);
if (!cgrp)
return NULL;
}

return cgroup_css(cgrp, ss);
}

The comment of cgroup_e_css_by_mask() says:

* Similar to cgroup_css() but returns the effective css, which is defined
* as the matching css of the nearest ancestor including self which has @ss
* enabled. If @ss is associated with the hierarchy @cgrp is on, this
* function is guaranteed to return non-NULL css.

It's hard for me to interpret the second sentence, specifically, what does
"@ss is associated with the hierarchy @cgrp is on" mean. I interpret it
as "subsys is enabled in root and/or any descendants".

But again, in the find_existing_css_set() it is called when the root
cgroup has enabled the subsys, so it should always return a non-NULL css.

And that means for any process, get_current_misc_cg() cannot be NULL.