Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

From: Sean Christopherson
Date: Mon Mar 22 2021 - 15:38:08 EST


On Mon, Mar 22, 2021, Borislav Petkov wrote:
> On Mon, Mar 22, 2021 at 11:56:37AM -0700, Sean Christopherson wrote:
> > Not necessarily. This can only trigger in the host, and thus require a host
> > reboot, if the host is also running enclaves. If the CSP is not running
> > enclaves, or is running its enclaves in a separate VM, then this path cannot be
> > reached.
>
> That's what I meant. Rebooting guests is a lot easier, ofc.
>
> Or are you saying, this can trigger *only* when they're running enclaves
> on the *host* too?

Yes. Note, it's still true if you strike out the "too", KVM support is completely
orthogonal to this code. The purpose of this patch is to separate out the EREMOVE
path used for host enclaves (/dev/sgx_enclave), because EPC virtualization for
KVM will have non-buggy scenarios where EREMOVE can fail. But the virt EPC code
is designed to handle that gracefully.

> > EREMOVE can only fail if there's a kernel or hardware bug (or a VMM bug if
> > running as a guest).
>
> We get those on a daily basis.
>
> > IME, nearly every kernel/KVM bug that I introduced that led to EREMOVE
> > failure was also quite fatal to SGX, i.e. this is just the canary in
> > the coal mine.
> >
> > It's certainly possible to add more sophisticated error handling, e.g. through
> > the pages onto a list and periodically try to recover them. But, since the vast
> > majority of bugs that cause EREMOVE failure are fatal to SGX, implementing
> > sophisticated handling is quite low on the list of priorities.
> >
> > Dave wanted the "page leaked" error message so that it's abundantly clear that
> > the kernel is leaking pages on EREMOVE failure and that the WARN isn't "benign".
>
> So this sounds to me like this should BUG too eventually.
>
> Or is this one of those "this should never happen" things so no one
> should worry?

Hmm. I don't think it warrants BUG. At worst, leaking EPC pages is fatal only
to SGX. If the underlying bug caused other fallout, e.g. didn't release a lock,
then obviously that could be fatal to the kernel. But I don't think there's
ever a case where SGX being unusuable would prevent the kernel from functioning.

> Whatever it is, if an admin sees this message in dmesg and doesn't get a
> lengthy explanation what she/he is supposed to do, I don't think she/he
> will be as relaxed.
>
> Hell, people open bugs for correctable ECCs and are asking whether they
> need to replace their hardware.

LOL.

> So let's play this out: put yourself in an admin's shoes and tell me how
> should an admin react when she/he sees that?
>
> Should the kernel probably also say: "Don't worry, you have enough
> memory and what's a 4K, who cares? You'll reboot eventually."

> Or should the kernel say "You need to reboot ASAP."
>
> And so on...
>
> So what is the scenario here and what kind of reaction is that message
> supposed to cause, recovery action, blabla, the whole spiel?

Probably something in between. Odds are good SGX will eventually become
unusuable, e.g. either kernel SGX support is completely hosted, or it will soon
leak the majority of EPC pages. Something like this?

"EREMOVE returned %d (0x%x), kernel bug likely. EPC page leaked, SGX may become unusuable. Reboot recommended to continue using SGX."