Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption

From: Dave Hansen
Date: Mon May 09 2022 - 18:56:28 EST


On 5/9/22 15:17, Borislav Petkov wrote:
>
>> This new ABI provides a way to avoid that situation in the first place.
>> Userspace can look at sysfs to figure out which NUMA nodes support
>> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to
>> avoid TDH.MEM.PAGE.ADD failures.
>>
>> So, here's the question for the TDX folks: are these mixed-capability
>> systems a problem for you? Does this ABI help you fix the problem?
> What I'm not really sure too is, is per-node granularity ok? I guess it
> is but let me ask it anyway...

I think nodes are the only sane granularity.

tl;dr: Zones might work in theory but have no existing useful ABI around
them and too many practical problems. Nodes are the only other real
option without inventing something new and fancy.

--

What about zones (or any sub-node granularity really)?

Folks have, for instance, discussed adding new memory zones for this
purpose: have ZONE_NORMAL, and then ZONE_UNENCRYPTABLE (or something
similar). Zones are great because they have their own memory allocation
pools and can be targeted directly from within the kernel using things
like GFP_DMA. If you run out of ZONE_FOO, you can theoretically just
reclaim ZONE_FOO.

But, even a single new zone isn't necessarily good enough. What if we
have some ZONE_NORMAL that's encryption-capable and some that's not?
The same goes for ZONE_MOVABLE. We'd probably need at least:

ZONE_NORMAL
ZONE_NORMAL_UNENCRYPTABLE
ZONE_MOVABLE
ZONE_MOVABLE_UNENCRYPTABLE

Also, zones are (mostly) not exposed to userspace. If we want userspace
to be able to specify encryption capabilities, we're talking about new
ABI for enumeration and policy specification.

Why node granularity?

First, for the majority of cases, nodes "just work". ACPI systems with
an "HMAT" table already separate out different performance classes of
memory into different Proximity Domains (PXMs) which the kernel maps
into NUMA nodes.

This means that for NVDIMMs or virtually any CXL memory regions (one or
more CXL devices glued together) we can think of, they already get their
own NUMA node. Those nodes have their own zones (implicitly) and can
lean on the existing NUMA ABI for enumeration and policy creation.

Basically, the firmware creates the NUMA nodes for the kernel. All the
kernel has to do is acknowledge which of them can do encryption or not.

The one place where nodes fall down is if a memory hot-add occurs within
an existing node and the newly hot-added memory does not match the
encryption capabilities of the existing memory. The kernel basically
has two options in that case:
* Throw away the memory until the next reboot where the system might be
reconfigured in a way to support more uniform capabilities (this is
actually *likely* for a reboot of a TDX system)
* Create a synthetic NUMA node to hold it

Neither one of those is a horrible option. Throwing the memory away is
the most likely way TDX will handle this situation if it pops up. For
now, the folks building TDX-capable BIOSes claim emphatically that such
a system won't be built.