Yang Shi wrote:
On Thu, Oct 3, 2024 at 4:32 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
Yang Shi wrote:
On Mon, Sep 30, 2024 at 4:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
Hi, David,
Thanks a lot for comments!
David Hildenbrand <david@xxxxxxxxxx> writes:
On 30.09.24 07:51, Huang Ying wrote:
On systems with TDX (Trust Domain eXtensions) enabled, memory ranges
hot-added must be checked for compatibility by TDX. This is currently
implemented through memory hotplug notifiers for each memory_block.
If a memory range which isn't TDX compatible is hot-added, for
example, some CXL memory, the command line as follows,
$ echo 1 > /sys/devices/system/node/nodeX/memoryY/online
will report something like,
bash: echo: write error: Operation not permitted
If pr_debug() is enabled, the error message like below will be shown
in the kernel log,
online_pages [mem 0xXXXXXXXXXX-0xXXXXXXXXXX] failed
Both are too general to root cause the problem. This will confuse
users. One solution is to print some error messages in the TDX memory
hotplug notifier. However, memory hotplug notifiers are called for
each memory block, so this may lead to a large volume of messages in
the kernel log if a large number of memory blocks are onlined with a
script or automatically. For example, the typical size of memory
block is 128MB on x86_64, when online 64GB CXL memory, 512 messages
will be logged.
ratelimiting would likely help here a lot, but I agree that it is
suboptimal.
Therefore, in this patch, the whole hot-adding memory range is
checked
for TDX compatibility through a newly added architecture specific
function (arch_check_hotplug_memory_range()). If rejected, the memory
hot-adding will be aborted with a proper kernel log message. Which
looks like something as below,
virt/tdx: Reject hot-adding memory range: 0xXXXXXXXX-0xXXXXXXXX
for TDX compatibility.
The target use case is to support CXL memory on TDX enabled systems.If the CXL memory isn't compatible with TDX, the whole CXL memory
range hot-adding will be rejected. While the CXL memory can still be
used via devdax interface.
I'm curious, why can that memory be used through devdax but not
through the buddy? I'm probably missing something important :)
Because only TDX compatible memory can be used for TDX guest. The buddy
is used to allocate memory for TDX guest. While devdax will not be used
for that.
Sorry for chiming in late. I think CXL also faces the similar problem
on the platform with MTE (memory tagging extension on ARM64). AFAIK,
we can't have MTE on CXL, so CXL has to stay as dax device if MTE is
enabled.
We should need a similar mechanism to prevent users from hot-adding
CXL memory if MTE is on. But not like TDX I don't think we have a
simple way to tell whether the pfn belongs to CXL or not. Please
correct me if I'm wrong. I'm wondering whether we can find a more
common way to tell memory hotplug to not hot-add some region. For
example, a special flag in struct resource. off the top of my head.
No solid idea yet, I'm definitely seeking some advice.
Could the ARM version of arch_check_hotplug_memory_range() check if MTE
is enabled in the CPU and then ask the CXL subsystem if the address range is
backed by a topology that supports MTE?
Kernel can tell whether MTE is really enabled. For the CXL part, IIUC
that relies on the CXL subsystem is able to tell whether that range
can support MTE or not, right? Or CXL subsystem tells us whether the
range is CXL memory range or not, then we can just refuse MTE for all
CXL regions for now. Does CXL support this now?
So the CXL specification has section:
8.2.4.31 CXL Extended Metadata Capability Register
...that indicates if the device supports "Extended Metadata" (EMD).
However, the CXL specification does not talk about how a given hosts
uses the extended metadata capabilities of a device. That detail would
need to come from an ARM platform specification.
Currently CXL subsystem does nothing with this since there has been no
need to date, but I would expect someone from the ARM side to plumb this
detection into the CXL subsystem.
However, why would it be ok to access CXL memory without MTE via devdax,
but not as online page allocator memory?
CXL memory can be onlined as system ram as long as MTE is not enabled.
It just can be used as devdax device if MTE is enabled.
Do you mean the kernel only manages MTE for kernel pages, but with user
mapped memory the application will need to implicitly know that
memory-tagging is not available?
I worry about applications that might not know that their heap is coming
from a userspace memory allocator backed by device-dax rather than the
kernel.
If the goal is to simply deny any and all non-MTE supported CXL region
from attaching then that could probably be handled as a modification to
the "cxl_acpi" driver to deny region creation unless it supports
everything the CPU expects from "memory".
I'm not quite familiar with the details in CXL driver. What did you
mean "deny region creation"? As long as the CXL memory still can be
used as devdax device, it should be fine.
Meaning that the CXL subsytem knows how to, for a given address range, figure
out the members and geometry of the CXL devices that contribute to that
range (CXL region). It would be straightforward to add EMD to that
enumeration and flag the CXL region as not online-capable if the CPU has
MTE enabled but no EMD capability.