On Wed, Aug 25, 2021 at 12:23 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
On 25.08.21 02:58, Dan Williams wrote:
On Mon, Aug 16, 2021 at 7:25 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible. On the one hand, the
virtio-mem driver adds/removes whole memory blocks, creating/removing busy
IORESOURCE_SYSTEM_RAM resources, on the other hand, it logically (un)plugs
memory inside added memory blocks, dynamically either exposing them to
the buddy or hiding them from the buddy and marking them PG_offline.
virtio-mem wants to make sure that in a sane environment, nobody
"accidentially" accesses unplugged memory inside the device managed
region. After /proc/kcore has been sanitized and /dev/kmem has been
removed, /dev/mem is the remaining interface that still allows uncontrolled
access to the device-managed region of virtio-mem devices from user
There is no known sane use case for mapping virtio-mem device memory
via /dev/mem while virtio-mem driver concurrently (un)plugs memory inside
that region. So once the driver was loaded and detected the device
along the device-managed region, we just want to disallow any access via
/dev/mem to it.
Let's add the basic infrastructure to exclude some physical memory
regions completely from /dev/mem access, on any architecture and under
any system configuration (independent of CONFIG_STRICT_DEVMEM and
independent of "iomem=").
I'm certainly on team "/dev/mem considered harmful", but this approach
feels awkward. It feels wrong for being non-committal about whether
CONFIG_STRICT_DEVMEM is in wide enough use that the safety can be
turned on all the time, and the configuration option dropped, or there
are users clinging onto /dev/mem where they expect to be able to build
a debug kernel to turn all of these restrictions off, even the
virtio-mem ones. This splits the difference and says some /dev/mem
accesses are always disallowed for "reasons", but I could say the same
thing about pmem, there's no sane reason to allow /dev/mem which has
no idea about the responsibilities of properly touching pmem to get
access to it.
For virtio-mem, there is no use case *and* access could be harmful; I
don't even want to allow if for debugging purposes. If you want to
inspect virtio-mem device memory content, use /proc/kcore, which
performs proper synchronized access checks. Modifying random virtio-mem
memory via /dev/mem in a debug kernel will not be possible: if you
really have to play with fire, use kdb or better don't load the
virtio-mem driver during boot, such that the kernel won't even be making
use of device memory.
I don't want people disabling CONFIG_STRICT_DEVMEM, or booting with
"iomem=relaxed", and "accidentally" accessing any of virtio-mem memory
via /dev/mem, while it gets concurrently plugged/unplugged by the
virtio-mem driver. Not even for debugging purposes.
That sounds more an argument that all of the existing "kernel is using
this region" cases should become mandatory exclusions. If unloading
the driver removes the exclusion then that's precisely
CONFIG_IO_STRICT_DEVMEM. Why is the virtio-mem driver more special
than any other driver that expects this integrity guarantee?
We disallow mapping to some other regions independent of
CONFIG_STRICT_DEVMEM already, so the idea to ignore CONFIG_STRICT_DEVMEM
is not completely new:
"Note that with PAT support enabled, even in this case there are
restrictions on /dev/mem use due to the cache aliasing requirements."
Maybe you even want to do something similar with PMEM now that there is
infrastructure for it and just avoid having to deal with revoking
/dev/mem mappings later.
That would be like blocking writes to /dev/sda just because a
filesytem might later be mounted on it. If the /dev/mem access is not
actively colliding with other kernel operations what business does the
kernel have saying no?
I'm pushing on this topic because I am also considering an exclusion
on PCI configuration access to the "DOE mailbox" since it can disrupt
the kernel's operation, at the same time, root can go change PCI BARs
to nonsensical values whenever it wants which is also in the category
of "has no use case && could be harmful".
I think there are weird debugging/educational setups  that still
require CONFIG_STRICT_DEVMEM=n even with iomem=relaxed. Take a look at
lib/devmem_is_allowed.c:devmem_is_allowed(), it disallows any access to
(what's currently added as) System RAM. It might just do what people
want when dealing with system RAM that doesn't suddenly vanish , so I
don't ultimately see why we should remove CONFIG_STRICT_DEVMEM=n.
Yes, I wanted to tease out more of your rationale on where the line
should be drawn, I think a mostly unfettered /dev/mem mode is here to