Re: [PATCH RFCv2 0/4] virtio-mem: paravirtualized memory
From: David Hildenbrand
Date: Wed May 23 2018 - 13:34:38 EST
On 23.05.2018 20:24, David Hildenbrand wrote:
> This is the Linux driver side of virtio-mem. Compared to the QEMU side,
> it is in a pretty complete and clean state.
>
> virtio-mem is a paravirtualized mechanism of adding/removing memory to/from
> a VM. We can do this on a 4MB granularity right now. In Linux, all
> memory is added to the ZONE_NORMAL, so unplugging cannot be guaranteed -
> but will be more likely to succeed compared to unplugging 128MB+ chunks.
> We might implement some optimizations in that area in the future that will
> make memory unplug more reliable.
>
> For now, this is an easy way to give a VM access to more memory and
> eventually to remove some memory again. I am testing it on x86 and
> s390x (under QEMU TCG so far only).
>
> This is the follow up on [1], but the concept, user interface and
> virtio protocol has been heavily changed. I am only including the important
> parts in this cover letter (because otherwise nobody will read it). Please
> feel free to ask in case there are any questions.
>
> This series is based on [4] and shows how it is being used. It contains
> further information. Also have a look at the description of patch nr 4 in
> this series.
>
> This work is the result of the initital idea of Andrea Arcangeli to host
> enforce guest access to memory inflated in virtio-balloon using
> userfaultfd, which turned out to be problematic to implement. That's how
> I came up with virtio-mem.
>
> --------------------------------------------------------------------------
> 1. High level concept
> --------------------------------------------------------------------------
>
> Each virtio-mem device owns a memory region in the physical address space.
> The guest is allowed to plug and online up to 'requested_size' of memory.
> It will not be allowed to plug more than that size. Unplugged memory will
> be protected by configurable mechanisms (e.g. random discard, userfaultfd
> protection, etc.). virtio-mem is designed in a way that a guest may never
> assume to be able to even read unplugged memory. This is a big difference
> to classical balloon drivers.
>
> The usable memory region might grow over time, so not all parts of the
> device memory region might be usable from the start. This is an
> optimization to allow a smarter implementation in the hypervisor (reduce
> size of dirty bitmaps, size of memory regions ...).
>
> When the device driver starts up, it will query 'requested_size' and start
> to add memory to the system. This memory is not indicated e.g. via ACPI,
> so unmodified systems will not silently try to use unplugged memory that
> they are not supposed to touch.
>
> Updates on the 'requested_size' indicate hypervisor requests to plug or
> unplug memory.
>
> As each virtio-mem device can belong to a NUMA node, we can easily
> plug/unplug memory on a NUMA basis. And of course, we can have several
> independent virtio-mem devices for a VM.
>
> The idea is *not* to add new virtio-mem devices when hotplugging memory,
> the idea is to resize (grow/shrink) virtio-mem devices.
>
> --------------------------------------------------------------------------
> 2. Benefits
> --------------------------------------------------------------------------
>
> Guest side:
> - Increase memory usable by Linux in 4MB steps (vs. section size like 128MB
> on x86 or 2GB on e.g. some arm if I'm not mistaking)
> - Remove struct pages once all 4MB chunks of a section are offline (in
> contrast to all balloon drivers where this never happens)
> - Don't fragment memory, while still being able to unplug smaller chunks
> than ordinary DIMM sizes.
> - Memory hotplug support for architectures that have no proper interface
> (e.g. s390x misses the external notification part) or e.g. QEMU/Linux
> support is complicated to implement.
> - Automatic management of onlining/offlining in the device driver -
> no manual interaction from an admin/tool necessary.
>
> QEMU side:
> - Resizing (plug/unplug) has a single interface - in contrast to a mixture
> of ACPI and virtio-balloon. See the example below.
> - Migration works out of the box - no need to specify new DIMMs or new
> sizes on the migration target. It simply works.
> - We can resize in arbitrary steps and sizes (in contrast to e.g. ACPI,
> where we have to know upfront in which granularity we later on want to
> remove memory or even how much memory we eventually want to add to our
> guest)
> - One interface to rule them (architectures) all :)
>
> --------------------------------------------------------------------------
> 3. Reboot handling
> --------------------------------------------------------------------------
>
> After a reboot, all memory is unplugged. This allows the hypervisor
> to see if support for virtio-mem is available in the freshly booted system.
> This way we could charge only for the actually "plugged" memory size. And
> it avoids to sense for plugged memory in the guest.
>
> E.g. on every size change of a virtio-mem device, we can notify management
> layers. So we can track how much memory a VM has plugged.
>
> --------------------------------------------------------------------------
> 4. Example
> --------------------------------------------------------------------------
>
> (not including resizable memory regions on the QEMU side yet, so don't
> focus on that part - it will consume a lot of memory right now for e.g.
> dirty bitmaps and memory slot tracking data)
>
> Start QEMU with two virtio-mem devices that provide little memory inititally.
> $ qemu-system-x86_64 -m 4G,maxmem=504G \
> -smp sockets=2,cores=2 \
> [...]
> -object memory-backend-ram,id=mem0,size=256G \
> -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,size=4160M \
> -object memory-backend-ram,id=mem1,size=256G \
> -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,size=3G
>
> Query the configuration ('size' tells us the guest driver is active):
> (qemu) info memory-devices
> info memory-devices
> Memory device [virtio-mem]: "vm0"
> phys-addr: 0x140000000
> node: 0
> requested-size: 4362076160
> size: 4362076160
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem0
> Memory device [virtio-mem]: "vm1"
> phys-addr: 0x4140000000
> node: 1
> requested-size: 3221225472
> size: 3221225472
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem1
>
> Change the size of a virtio-mem device:
> (qemu) memory-device-resize vm0 40960
> memory-device-resize vm0 40960
> ...
> (qemu) info memory-devices
> info memory-devices
> Memory device [virtio-mem]: "vm0"
> phys-addr: 0x140000000
> node: 0
> requested-size: 42949672960
> size: 42949672960
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem0
> ...
>
> Try to unplug memory (KASAN active in the guest - a lot of memory wasted):
> (qemu) memory-device-resize vm0 1024
> memory-device-resize vm0 1024
> ...
> (qemu) info memory-devices
> info memory-devices
> Memory device [virtio-mem]: "vm0"
> phys-addr: 0x140000000
> node: 0
> requested-size: 1073741824
> size: 6169821184
> max-size: 274877906944
> block-size: 4194304
> memdev: /objects/mem0
> ...
>
> I am sharing for now only the linux driver side. The current code can be
> found at [2]. The QEMU side is still heavily WIP, the current QEMU
> prototype can be found at [3].
>
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg03870.html
> [2] https://github.com/davidhildenbrand/linux/tree/virtio-mem
> [3] https://github.com/davidhildenbrand/qemu/tree/virtio-mem
> [4] https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1698014.html
>
> David Hildenbrand (4):
> ACPI: NUMA: export pxm_to_node
> s390: mm: support removal of memory
> s390: numa: implement memory_add_physaddr_to_nid()
> virtio-mem: paravirtualized memory
>
> arch/s390/mm/init.c | 18 +-
> arch/s390/numa/numa.c | 12 +
> drivers/acpi/numa.c | 1 +
> drivers/virtio/Kconfig | 15 +
> drivers/virtio/Makefile | 1 +
> drivers/virtio/virtio_mem.c | 1040 +++++++++++++++++++++++++++++++
> include/uapi/linux/virtio_ids.h | 1 +
> include/uapi/linux/virtio_mem.h | 134 ++++
> 8 files changed, 1216 insertions(+), 6 deletions(-)
> create mode 100644 drivers/virtio/virtio_mem.c
> create mode 100644 include/uapi/linux/virtio_mem.h
>
cc-ing some further mailing lists
--
Thanks,
David / dhildenb