Re: maybe revert commit c275a57f5ec3 "xen/balloon: Set balloon's initial state to number of existing RAM pages"
From: Konrad Rzeszutek Wilk
Date: Fri Mar 24 2017 - 17:11:43 EST
On Fri, Mar 24, 2017 at 04:34:23PM -0400, Dan Streetman wrote:
> On Wed, Mar 22, 2017 at 10:13 PM, Boris Ostrovsky
> <boris.ostrovsky@xxxxxxxxxx> wrote:
> >
> >
> > On 03/22/2017 05:16 PM, Dan Streetman wrote:
> >>
> >> I have a question about a problem introduced by this commit:
> >> c275a57f5ec3056f732843b11659d892235faff7
> >> "xen/balloon: Set balloon's initial state to number of existing RAM pages"
> >>
> >> It changed the xen balloon current_pages calculation to start with the
> >> number of physical pages in the system, instead of max_pfn. Since
> >> get_num_physpages() does not include holes, it's always less than the
> >> e820 map's max_pfn.
> >>
> >> However, the problem that commit introduced is, if the hypervisor sets
> >> the balloon target to equal to the e820 map's max_pfn, then the
> >> balloon target will *always* be higher than the initial current pages.
> >> Even if the hypervisor sets the target to (e820 max_pfn - holes), if
> >> the OS adds any holes, the balloon target will be higher than the
> >> current pages. This is the situation, for example, for Amazon AWS
> >> instances. The result is, the xen balloon will always immediately
> >> hotplug some memory at boot, but then make only (max_pfn -
> >> get_num_physpages()) available to the system.
> >>
> >> This balloon-hotplugged memory can cause problems, if the hypervisor
> >> wasn't expecting it; specifically, the system's physical page
> >> addresses now will exceed the e820 map's max_pfn, due to the
> >> balloon-hotplugged pages; if the hypervisor isn't expecting pt-device
> >> DMA to/from those physical pages above the e820 max_pfn, it causes
> >> problems. For example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129
> >>
> >> The additional small amount of balloon memory can cause other problems
> >> as well, for example:
> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457
> >>
> >> Anyway, I'd like to ask, was the original commit added because
> >> hypervisors are supposed to set their balloon target to the guest
> >> system's number of phys pages (max_pfn - holes)? The mailing list
> >> discussion and commit description seem to indicate that.
> >
> >
> >
> > IIRC the problem that this was trying to fix was that since max_pfn includes
> > holes, upon booting we'd immediately balloon down by the (typically, MMIO)
> > hole size.
> >
> > If you boot a guest with ~4+GB memory you should see this.
> >
> >
> >> However I'm
> >> not sure how that is possible, because the kernel reserves its own
> >> holes, regardless of any predefined holes in the e820 map; for
> >> example, the kernel reserves 64k (by default) at phys addr 0 (the
> >> amount of reservation is configurable via CONFIG_X86_RESERVE_LOW). So
> >> the hypervisor really has no way to know what the "right" target to
> >> specify is; unless it knows the exact guest OS and kernel version, and
> >> kernel config values, it will never be able to correctly specify its
> >> target to be exactly (e820 max_pfn - all holes).
> >>
> >> Should this commit be reverted? Should the xen balloon target be
> >> adjusted based on kernel-added e820 holes?
> >
> >
> > I think the second one but shouldn't current_pages be updated, and not the
> > target? The latter is set by Xen (toolstack, via xenstore usually).
> >
> > Also, the bugs above (at least one of them) talk about NVMe and I wonder
> > whether the memory that they add is of RAM type --- I believe it has its own
> > type and so perhaps that introduces additional inconsistencies. AWS may have
> > added their own support for that, which we don't have upstream yet.
>
> The type of memory doesn't have anything to do with it.
>
> The problem with NVMe is it's a passthrough device, so the guest talks
> directly to the NVMe controller and does DMA with it. But the
> hypervisor does swiotlb translation between the guest physical memory,
Um, the hypervisor does not have SWIOTLB support, only IOMMU support.
> and the host physical memory, so that the NVMe device can correctly
> DMA to the right memory in the host.
>
> However, the hypervisor only has the guest's physical memory up to the
> max e820 pfn mapped; it didn't expect the balloon driver to hotplug
> any additional memory above the e820 max pfn, so when the NVMe driver
> in the guest tries to tell the NVMe controller to DMA to that
> balloon-hotplugged memory, the hypervisor fails the NVMe request,
But when the memory hotplug happens the hypercalls are done to
raise the max pfn.
> because it can't do the guest-to-host phys mem mapping, since the
> guest phys address is outside the expected max range.
>
>
>
> >
> > -boris
> >
> >
> >
> >> Should something else be
> >> done?
> >>
> >> For context, Amazon Linux has simply disabled Xen ballooning
> >> completely. Likewise, we're planning to disable Xen ballooning in the
> >> Ubuntu kernel for Amazon AWS-specific kernels (but not for non-AWS
> >> Ubuntu kernels). However, if reverting this patch makes sense in a
> >> bigger context (i.e. Xen users besides AWS), that would allow more
> >> Ubuntu kernels to work correctly in AWS instances.
> >>
> >