Re: [PATCH v3 0/6] mm / virtio: Provide support for unused page reporting

From: Alexander Duyck
Date: Fri Aug 02 2019 - 11:13:59 EST


On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
> On 8/1/19 6:24 PM, Alexander Duyck wrote:
> > This series provides an asynchronous means of reporting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as unused page reporting
> >
> > The functionality for this is fairly simple. When enabled it will allocate
> > statistics to track the number of reported pages in a given free area.
> > When the number of free pages exceeds this value plus a high water value,
> > currently 32, it will begin performing page reporting which consists of
> > pulling pages off of free list and placing them into a scatter list. The
> > scatterlist is then given to the page reporting device and it will perform
> > the required action to make the pages "reported", in the case of
> > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > and as such they are forced out of the guest. After this they are placed
> > back on the free list, and an additional bit is added if they are not
> > merged indicating that they are a reported buddy page instead of a
> > standard buddy page. The cycle then repeats with additional non-reported
> > pages being pulled until the free areas all consist of reported pages.
> >
> > I am leaving a number of things hard-coded such as limiting the lowest
> > order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> > determine what the limit is on how many pages it wants to allocate to
> > process the hints. The upper limit for this is based on the size of the
> > queue used to store the scatterlist.
> >
> > My primary testing has just been to verify the memory is being freed after
> > allocation by running memhog 40g on a 40g guest and watching the total
> > free memory via /proc/meminfo on the host. With this I have verified most
> > of the memory is freed after each iteration. As far as performance I have
> > been mainly focusing on the will-it-scale/page_fault1 test running with
> > 16 vcpus. With that I have seen up to a 2% difference between the base
> > kernel without these patches and the patches with virtio-balloon enabled
> > or disabled.
>
> A couple of questions:
>
> - The 2% difference which you have mentioned, is this visible for
> all the 16 cores or just the 16th core?
> - I am assuming that the difference is seen for both "number of process"
> and "number of threads" launched by page_fault1. Is that right?

Really, the 2% is bordering on just being noise. Sometimes it is better
sometimes it is worse. However I think it is just slight variability in
the tests since it doesn't usually form any specific pattern.

I have been able to tighten it down a bit by actually splitting my guest
over 2 nodes and pinning the vCPUs so that the nodes in the guest match up
to the nodes in the host. Doing that I have seen results where I had less
than 1% variability between with the patches and without.

One thing I am looking at now is modifying the page_fault1 test to use THP
instead of 4K pages as I suspect there is a fair bit of overhead in
accessing the pages 4K at a time vs 2M at a time. I am hoping with that I
can put more pressure on the actual change and see if there are any
additional spots I should optimize.

> > One side effect of these patches is that the guest becomes much more
> > resilient in terms of NUMA locality. With the pages being freed and then
> > reallocated when used it allows for the pages to be much closer to the
> > active thread, and as a result there can be situations where this patch
> > set will out-perform the stock kernel when the guest memory is not local
> > to the guest vCPUs.
>
> Was this the reason because of which you were seeing better results for
> page_fault1 earlier?

Yes I am thinking so. What I have found is that in the case where the
patches are not applied on the guest it takes a few runs for the numbers
to stabilize. What I think was going on is that I was running memhog to
initially fill the guest and that was placing all the pages on one node or
the other and as such was causing additional variability as the pages were
slowly being migrated over to the other node to rebalance the workload.
One way I tested it was by trying the unpatched case with a direct-
assigned device since that forces it to pin the memory. In that case I was
getting bad results consistently as all the memory was forced to come from
one node during the pre-allocation process.