On Mon, Oct 09, 2023, Kai Huang wrote:Cgroup would refuse to allocate more when limit is reached so VMs can not run above limit.
On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> +/**
> + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> + * @lru: LRU that is low
> + *
> + * Return: %true if a victim was found and kicked.
> + */
> +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> +{
> + struct sgx_epc_page *victim;
> +
> + spin_lock(&lru->lock);
> + victim = sgx_oom_get_victim(lru);
> + spin_unlock(&lru->lock);
> +
> + if (!victim)
> + return false;
> +
> + if (victim->flags & SGX_EPC_OWNER_PAGE)
> + return sgx_oom_encl_page(victim->encl_page);
> +
> + if (victim->flags & SGX_EPC_OWNER_ENCL)
> + return sgx_oom_encl(victim->encl);
I hate to bring this up, at least at this stage, but I am wondering why we need
to put VA and SECS pages to the unreclaimable list, but cannot keep an
"enclave_list" instead?
The motivation for tracking EPC pages instead of enclaves was so that the EPC
OOM-killer could "kill" VMs as well as host-owned enclaves. The virtual EPC code
didn't actually kill the VM process, it instead just freed all of the EPC pages
and abused the SGX architecture to effectively make the guest recreate all its
enclaves (IIRC, QEMU does the same thing to "support" live migration).
Looks like y'all punted on that with:
The EPC pages allocated for KVM guests by the virtual EPC driver are not
reclaimable by the host kernel [5]. Therefore they are not tracked by any
LRU lists for reclaiming purposes in this implementation, but they are
charged toward the cgroup of the user processs (e.g., QEMU) launching the
guest. And when the cgroup EPC usage reaches its limit, the virtual EPC
driver will stop allocating more EPC for the VM, and return SIGBUS to the
user process which would abort the VM launch.
which IMO is a hack, unless returning SIGBUS is actually enforced somehow. Relying
on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
enforcement. E.g. if the hard limit for a EPC cgroup is lowered, userspace running
encalves in a VM could continue on and refuse to give up its EPC, and thus run above
its limit in perpetuity.
I can see userspace wanting to explicitly terminate the VM instead of "silently"
the VM's enclaves, but that seems like it should be a knob in the virtual EPC
code.