Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

From: Borislav Petkov
Date: Mon Mar 22 2021 - 17:07:25 EST


On Mon, Mar 22, 2021 at 12:37:02PM -0700, Sean Christopherson wrote:
> Yes. Note, it's still true if you strike out the "too", KVM support is completely
> orthogonal to this code. The purpose of this patch is to separate out the EREMOVE
> path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
> KVM will have non-buggy scenarios where EREMOVE can fail. But the virt EPC code
> is designed to handle that gracefully.

"gracefully" as it won't leak EPC pages which would require a host reboot? That
leaking is done by host enclaves only?

> Hmm. I don't think it warrants BUG. At worst, leaking EPC pages is fatal only
> to SGX.

Fatal how? If it keeps leaking, at some point it won't have any pages
for EPC pages anymore?

Btw, I probably have seen this and forgotten again so pls remind me,
is the amount of pages available for SGX use static and limited by,
I believe BIOS, or can a leakage in EPC pages cause system memory
shortage?

> If the underlying bug caused other fallout, e.g. didn't release a
> lock, then obviously that could be fatal to the kernel. But I don't
> think there's ever a case where SGX being unusuable would prevent the
> kernel from functioning.

This kinda replies my question above but still...

> Probably something in between. Odds are good SGX will eventually become
> unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
> leak the majority of EPC pages. Something like this?
>
> "EREMOVE returned %d (0x%x), kernel bug likely. EPC page leaked, SGX may become unusuable. Reboot recommended to continue using SGX."

So all this handwaving I'm doing is to provoke a proper response from
you guys as to how a EPC page leaking is supposed to be handled by the
users of the technology:

1. Issue a warning message and forget about it, eventual reboot

2. Really scary message to make users reboot sooner

3. Detect when host enclaves are run while guest enclaves are running
and issue a warning then.

4. Fall on knees and pray to not get sued by customers because their
enclaves are not working anymore.

....

Btw, 4. needs to be considered properly so that people can cover asses.

Oh and whatever we end up deciding, we should document that in
Documentation/... somewhere and point users to it in that warning
message where a longer treatise is explaining the whole deal properly.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette