Re: [PATCH v3 0/6] mm / virtio: Provide support for unused page reporting

From: Nitesh Narayan Lal
Date: Fri Aug 02 2019 - 12:19:47 EST

On 8/2/19 11:13 AM, Alexander Duyck wrote:
> On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
>> On 8/1/19 6:24 PM, Alexander Duyck wrote:
>>> This series provides an asynchronous means of reporting to a hypervisor
>>> that a guest page is no longer in use and can have the data associated
>>> with it dropped. To do this I have implemented functionality that allows
>>> for what I am referring to as unused page reporting
>>> The functionality for this is fairly simple. When enabled it will allocate
>>> statistics to track the number of reported pages in a given free area.
>>> When the number of free pages exceeds this value plus a high water value,
>>> currently 32, it will begin performing page reporting which consists of
>>> pulling pages off of free list and placing them into a scatter list. The
>>> scatterlist is then given to the page reporting device and it will perform
>>> the required action to make the pages "reported", in the case of
>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>>> and as such they are forced out of the guest. After this they are placed
>>> back on the free list, and an additional bit is added if they are not
>>> merged indicating that they are a reported buddy page instead of a
>>> standard buddy page. The cycle then repeats with additional non-reported
>>> pages being pulled until the free areas all consist of reported pages.
>>> I am leaving a number of things hard-coded such as limiting the lowest
>>> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
>>> determine what the limit is on how many pages it wants to allocate to
>>> process the hints. The upper limit for this is based on the size of the
>>> queue used to store the scatterlist.
>>> My primary testing has just been to verify the memory is being freed after
>>> allocation by running memhog 40g on a 40g guest and watching the total
>>> free memory via /proc/meminfo on the host. With this I have verified most
>>> of the memory is freed after each iteration. As far as performance I have
>>> been mainly focusing on the will-it-scale/page_fault1 test running with
>>> 16 vcpus. With that I have seen up to a 2% difference between the base
>>> kernel without these patches and the patches with virtio-balloon enabled
>>> or disabled.
>> A couple of questions:
>> - The 2% difference which you have mentioned, is this visible for
>> all the 16 cores or just the 16th core?
>> - I am assuming that the difference is seen for both "number of process"
>> and "number of threads" launched by page_fault1. Is that right?
> Really, the 2% is bordering on just being noise. Sometimes it is better
> sometimes it is worse. However I think it is just slight variability in
> the tests since it doesn't usually form any specific pattern.
> I have been able to tighten it down a bit by actually splitting my guest
> over 2 nodes and pinning the vCPUs so that the nodes in the guest match up
> to the nodes in the host. Doing that I have seen results where I had less
> than 1% variability between with the patches and without.

Interesting. I usually pin the guest to a single NUMA node to avoid this.

> One thing I am looking at now is modifying the page_fault1 test to use THP
> instead of 4K pages as I suspect there is a fair bit of overhead in
> accessing the pages 4K at a time vs 2M at a time. I am hoping with that I
> can put more pressure on the actual change and see if there are any
> additional spots I should optimize.

+1. Right now I don't think will-it-scale touches all the guest memory.
May I know how much memory does will-it-scale/page_fault1, occupies in your case
and how much do you get back with your patch-set?

Do you have any plans of running any other benchmarks as well?
Just to see the impact on other sub-systems.

>>> One side effect of these patches is that the guest becomes much more
>>> resilient in terms of NUMA locality. With the pages being freed and then
>>> reallocated when used it allows for the pages to be much closer to the
>>> active thread, and as a result there can be situations where this patch
>>> set will out-perform the stock kernel when the guest memory is not local
>>> to the guest vCPUs.
>> Was this the reason because of which you were seeing better results for
>> page_fault1 earlier?
> Yes I am thinking so. What I have found is that in the case where the
> patches are not applied on the guest it takes a few runs for the numbers
> to stabilize. What I think was going on is that I was running memhog to
> initially fill the guest and that was placing all the pages on one node or
> the other and as such was causing additional variability as the pages were
> slowly being migrated over to the other node to rebalance the workload.
> One way I tested it was by trying the unpatched case with a direct-
> assigned device since that forces it to pin the memory. In that case I was
> getting bad results consistently as all the memory was forced to come from
> one node during the pre-allocation process.

I have also seen that the page_fault1 values take some time to get stabilize on
an unmodified kernel.
What I am wondering here is that if on a single NUMA guest doing the following
will give the right/better idea or not:

1. Pin the guest to a single NUMA node.
2. Run memhog so that it touches all the guest memory.
3. Run will-it-scale/page_fault1.

Compare/observe the values for the last core (this is considering the other core
values doesn't drastically differ).