Re: [PATCH v6 1/1] xen/time: do not decrease steal time after live migration on xen

From: Dongli Zhang
Date: Wed Nov 01 2017 - 21:20:36 EST


Hi Boris,

I have received from lkp@xxxxxxxxx that the prior version of patch hit issue
during compilation with aarch64-linux-gnu-gcc. I think this patch reviewed by
you would hit the same compiling issue on arm64 (there is no issue with x86_64).

-------------------------------------------------------------

1st issue:

Without including header <linux/slab.h> into driver/xen/time.c, compilation on
x86_64 works well (without any warning or error) but arm64 would hit the
following error:

drivers/xen/time.c: In function ‘xen_manage_runstate_time’:
drivers/xen/time.c:94:20: error: implicit declaration of function
‘kmalloc_array’ [-Werror=implicit-function-declaration]
runstate_delta = kmalloc_array(num_possible_cpus(),
^

drivers/xen/time.c:131:3: error: implicit declaration of function ‘kfree’
[-Werror=implicit-function-declaration]
kfree(runstate_delta);
^
cc1: some warnings being treated as errors

About the 1st issue, should I submit a new patch including <linux/slab.h> or
just a incremental based on previous patch merged into your own branch
/tree?

-------------------------------------------------------------

2nd issue:

aarch64-linux-gnu-gcc expects a cast for kmalloc_array(). Is this really
necessary as I did find people casting the return type of
kmalloc/kcalloc/kmalloc_array in linux source code (e.g.,
drivers/block/virtio_blk.c). Can we just ignore this warning?

drivers/xen/time.c:94:18: warning: assignment makes pointer from integer without
a cast [-Wint-conversion]
runstate_delta = kmalloc_array(num_possible_cpus(),
^
-------------------------------------------------------------


Thank you very much!

Dongli Zhang


On 11/02/2017 03:19 AM, Boris Ostrovsky wrote:
> On 10/31/2017 09:46 PM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu 198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu 200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables. Comments before the call of HYPERVISOR_suspend() has been
>> removed as it is inaccurate. The call can return an error code (e.g.,
>> possibly -EPERM in the future).
>
> I'd like split comment removal bit into a separate paragraph. I can do
> this when committing if you don't mind.
>
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
>>
>> ---
>
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>
>
>