Re: [RFC PATCH] /dev/mem: Disable /dev/mem under TDX guest

From: Nikolay Borisov
Date: Mon Mar 24 2025 - 05:59:35 EST




On 18.03.25 г. 21:06 ч., Dan Williams wrote:
Nikolay Borisov wrote:
If a piece of memory is read from /dev/mem that falls outside of the
System Ram region i.e bios data region the kernel creates a shared
mapping via xlate_dev_mem_ptr() (this behavior was introduced by
9aa6ea69852c ("x86/tdx: Make pages shared in ioremap()"). This results
in a region having both a shared and a private mapping.

Subsequent accesses to this region via the private mapping induce a
SEPT violation and a crash of the VMM. In this particular case the
scenario was a userspace process reading something from the bios data
area at address 0x497 which creates a shared mapping, and a followup
reboot accessing __va(0x472) which access pfn 0 via the private mapping
causing mayhem.

Fix this by simply forbidding access to /dev/mem when running as an TDX
guest.

Signed-off-by: Nikolay Borisov <nik.borisov@xxxxxxxx>
---

Sending this now to hopefully spur up discussion as to how to handle the described
scenario. This was hit on the GCP cloud and was causing their hypervisor to crash.

I guess the most pressing question is what will be the most sensible approach to
eliminate such situations happening in the future:

1. Should we forbid getting a descriptor to /dev/mem (this patch)
2. Skip creating /dev/mem altogether3
3. Possibly tinker with internals of ioremap to ensure that no memory which is
backed by kvm memslots is remapped as shared.

It seems unfortunate that the kernel is allowing conflicting mappings of
the same pfn. Is this not just a track_pfn_remap() bug report? In other
words, whatever established the conflicting private mapping failed to do
a memtype_reserve() with the encryption setting such that
track_pfn_remap() could find it and enforce a consistent mapping.

I'm not an expert into this, but looking at the code it seems memtype_reserve deals with the memory type w.r.t PAT/MTRR i.e the cacheability of the memory, not whether the mapping is consistent w.r.t to other, arbitrary attributes.


Otherwise, kernel_lockdown also disables useful mechanisms like debugfs,
and feels like it does not solve the underlying problem. Not all
ioremap() callers in the kernel are aware of a potential
ioremap_encrypted() dependendency.

4. Eliminate the access to 0x472 from the x86 reboot path, after all we don't
really have a proper bios at that address.
5. Something else ?