Re: [PATCH v11 0/6] mm / virtio: Provide support for unused page reporting

From: David Hildenbrand
Date: Thu Oct 10 2019 - 03:39:41 EST

On 09.10.19 21:46, Nitesh Narayan Lal wrote:
> On 10/9/19 12:35 PM, Alexander Duyck wrote:
>> On Wed, 2019-10-09 at 11:21 -0400, Nitesh Narayan Lal wrote:
>>> On 10/7/19 1:06 PM, Nitesh Narayan Lal wrote:
>>> [...]
>>>>> So what was the size of your guest? One thing that just occurred to me is
>>>>> that you might be running a much smaller guest than I was.
>>>> I am running a 30 GB guest.
>>>>>>> If so I would have expected a much higher difference versus
>>>>>>> baseline as zeroing/faulting the pages in the host gets expensive fairly
>>>>>>> quick. What is the host kernel you are running your test on? I'm just
>>>>>>> wondering if there is some additional overhead currently limiting your
>>>>>>> setup. My host kernel was just the same kernel I was running in the guest,
>>>>>>> just built without the patches applied.
>>>>>> Right now I have a different host-kernel. I can install the same kernel to the
>>>>>> host as well and see if that changes anything.
>>>>> The host kernel will have a fairly significant impact as I recall. For
>>>>> example running a stock CentOS kernel lowered the performance compared to
>>>>> running a linux-next kernel. As a result the numbers looked better since
>>>>> the overall baseline was lower to begin with as the host OS was
>>>>> introducing additional overhead.
>>>> I see in that case I will try by installing the same guest kernel
>>>> to the host as well.
>>> As per your suggestion, I tried replacing the host kernel with an
>>> upstream kernel without my patches i.e., my host has a kernel built on top
>>> of the upstream kernel's master branch which has Sept 23rd commit and the guest
>>> has the same kernel for the no-hinting case and same kernel + my patches
>>> for the page reporting case.
>>> With the changes reported earlier on top of v12, I am not seeing any further
>>> degradation (other than what I have previously reported).
>>> To be sure that THP is actively used, I did an experiment where I changed the
>>> MEMSIZE in the page_fault. On doing so THP usage checked via /proc/meminfo also
>>> increased as I expected.
>>> In any case, if you find something else please let me know and I will look into it
>>> again.
>>> I am still looking into your suggestion about cache line bouncing and will reply
>>> to it, if I have more questions.
>>> [...]
>> I really feel like this discussion has gone off course. The idea here is
>> to review this patch set[1] and provide working alternatives if there are
>> issues with the current approach.
> Agreed.
>> The bitmap based approach still has a number of outstanding issues
>> including sparse memory and hotplug which have yet to be addressed.
> True, but I don't think those two are a blocker.
> For sparse zone as we are maintaining the bitmap on a granularity of
> (MAX_ORDER - 2) / (MAX_ORDER - 1) etc. the memory wastage should be
> negligible in most of the cases.
> For memory hotplug/hotremove, I did make sure that I don't break anything.
> Even if a user starts using this feature with page-reporting enabled.
> However, it is true that I don't report or capture any memory added/removed
> thought it.
> Fixing these issues will be an optimization which I will do as I get my basic
> framework ready and in shape.
>> We can
>> gloss over that, but there is a good chance that resolving those would
>> have potential performance implications. With this most recent change
>> there is now also the fact that it can only really support reporting at
>> one page order so the solution is now much more prone to issues with
>> memory fragmentation than it was before. I would consider the fact that my
>> solution works with multiple page orders while the bitmap approach
>> requires MAX_ORDER - 1 seems like another obvious win for my solution.
> This is just a configuration change and only requires to update
> the macro 'PAGE_REPORTING_MIN_ORDER' to what you are using.
> What order do we want to report could vary based on the
> use case where we are deploying the solution.
> Ideally, this should be configurable maybe at the compile time
> or we can stick with pageblock_order which is originally suggested
> and used by you.
>> Until we can get back to the point where we are comparing apples to apples
>> I would prefer not to benchmark the bitmap solution as without the extra
>> order limitation it was over 20% worse then my solution performance wise..
> Understood.
> However, as I reported previously after making the configuration changes
> on top of v12 posting, I don't see the degradation.
> I will be happy to try out more suggestions to see if the issue is really fixed.
> I have started looking into your concern of cacheline bouncing after
> which I will look into Michal's suggestion of using page-isolation APIs to
> isolate and release pages back. After that, I can decide on
> posting my next series (if it is required).
>> Ideally I would like to get code review for patches 3 and 4, and spend my
>> time addressing issues reported there. The main things I need input on is
>> if the solution of allowing the list iterators to be reset is good enough
>> to address the compaction issues that were pointed out several releases
>> ago or if I have to look for another solution. Also I have changed things
>> so that page_reporting.h was split over two files with the new one now
>> living in the mm/ folder. By doing that I was hoping to reduce the
>> exposure of the internal state of the free-lists so that essentially all
>> we end up providing is an interface for the notifier to be used by virtio-
>> balloon.
> If everyone agrees that what you are proposing is the best way to move
> forward then, by all means, lets go ahead with it. :)

Sorry, i didn't get to follow the discussion, caught a cold and my body
is still fighting with the last resistance.

Is there any rough summary on how much faster Alexanders approach is
compared to some external tracking? For external tracking, there is a
lot of optimization potential as far as I can read, however, I think a
rough summary should be possible by now "how far we are off".

Also, are there benchmarks/setups where both perform the same?



David / dhildenb