Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

From: David Hildenbrand
Date: Tue Feb 19 2019 - 09:21:13 EST


On 19.02.19 15:17, Nitesh Narayan Lal wrote:
> On 2/19/19 8:03 AM, David Hildenbrand wrote:
>>>>>> There are two main ways to avoid allocation:
>>>>>> 1. do not add extra data on top of each chunk passed
>>>>> If I am not wrong then this is close to what we have right now.
>>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>>> allocation for the request. Once you're asynchronous via a notification
>>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>>> Whether we should go with kthread or without it, I would like to do some
>>> performance comparison before commenting on this.
>>>>> One issue I see right now is that I am polling while host is freeing the
>>>>> memory.
>>>>> In the next version I could tie the logic which returns pages to the
>>>>> buddy and resets the per cpu array index value to 0 with the callback.
>>>>> (i.e.., it happens once we receive an response from the host)
>>>> The question is, what happens when freeing pages and the array is not
>>>> ready to be reused yet. In that case, you want to somehow continue
>>>> freeing pages without busy waiting or eventually not reporting pages.
>>> This is what happens right now.
>>> Having kthread or not should not effect this behavior.
>>> When the array is full the current approach simply skips collecting the
>>> free pages.
>> Well, it somehow does affect your implementation. If you have a kthread
>> you always have to synchronize against the VCPU: "Is the pcpu array
>> ready to be used again".
>>
>> Once you do it asynchronously from your VCPU without another thread
>> being involved, such synchronization is not required. Simply prepare a
>> request and send it off. Reuse the pcpu array instantly. At least that's
>> the theory :)
>>
>> If you have a guest bulk freeing a lot of memory, I guess temporarily
>> dropping free page hints could be counter-productive. It really depends
>> on how fast the thread gets scheduled and how long the hinting process
>> takes. Having another thread involved might add a lot to that latency to
>> that formula. We'll have to measure, but my gut feeling is that once we
>> do stuff asynchronously, there is no need for a thread anymore.
> This is true.
>>
>>>> The callback should put the pages back to the buddy and free the request
>>>> eventually to have a fully asynchronous mechanism.
>>>>
>>>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>>> I am not sure if this is an arbitrary number we came up with here. We
>>>> should really play with different orders to find a hot spot. I wouldn't
>>>> consider this high priority, though. Getting the whole concept right to
>>>> be able to deal with any magic number we come up should be the ultimate
>>>> goal. (stuff that only works with huge pages I consider not future
>>>> proof, especially regarding fragmented guests which can happen easily)
>>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>>> a specific workload we don't get any memory back until we re-run the
>>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>>> This is why I think we may make this configurable from compile time and
>>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
>> Eventually pages will never get merged. Assume you have 1 page of a
>> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
>> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
>> wasted (depending on the arch). This stuff can sum up.
>
> After the discussion, here are the changes on which I am planning to
> work next:
> 1. Get rid of the kthread and dynamically allocate a per-cpu array to
> hold the
> isolated pages. As soon as the initial per-cpu array is completely
> scanned, release it
> so that we don't end up blocking anything.
> 2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
> array size to 256
> for now. As we are doing asynchronous reporting we should be fine with a
> lower size array.
> 3. As soon as the host responds, release the pages back to the buddy
> from the callback and free the request.
>
> Benefits wrt current implementation:
> 1. We will not eat up performance due to kernel thread.
> 2. We will still be doing reporting asynchronously=> no blocking.
> 3. Hopefully, we will be able to free more memory.
>

+1 to that approach. We can fine tune the numbers (array size, sizes to
report) easily later on. Let us know if you run into problems doing the
allocation for the request. If that is a blocker, all we left with is a
synchronous approach I guess. Let's cross fingers. :)

--

Thanks,

David / dhildenb