Re: [PATCH 1/9] KVM: arm64: Document PV-time interface

From: Steven Price
Date: Wed Aug 07 2019 - 11:27:00 EST

On 07/08/2019 15:28, Christophe de Dinechin wrote:
>> On 7 Aug 2019, at 15:21, Steven Price <steven.price@xxxxxxx
>> <mailto:steven.price@xxxxxxx>> wrote:
>> On 05/08/2019 17:40, Christophe de Dinechin wrote:
>>> Steven Price writes:
>>>> Introduce a paravirtualization interface for KVM/arm64 based on the
>>>> "Arm Paravirtualized Time for Arm-Base Systems" specification DEN 0057A.
>>>> This only adds the details about "Stolen Time" as the details of "Live
>>>> Physical Time" have not been fully agreed.
>>> [...]
>>>> +
>>>> +Stolen Time
>>>> +-----------
>>>> +
>>>> +The structure pointed to by the PV_TIME_ST hypercall is as follows:
>>>> +
>>>> + ÂField ÂÂÂÂÂÂ| Byte Length | Byte Offset | Description
>>>> + Â----------- | ----------- | ----------- | --------------------------
>>>> + ÂRevision ÂÂÂ| ÂÂÂÂÂ4 ÂÂÂÂÂ| ÂÂÂÂÂ0 ÂÂÂÂÂ| Must be 0 for version 0.1
>>>> + ÂAttributes Â| ÂÂÂÂÂ4 ÂÂÂÂÂ| ÂÂÂÂÂ4 ÂÂÂÂÂ| Must be 0
>>>> + ÂStolen time | ÂÂÂÂÂ8 ÂÂÂÂÂ| ÂÂÂÂÂ8 ÂÂÂÂÂ| Stolen time in unsigned
>>>> + ÂÂÂÂÂÂÂÂÂÂÂÂÂ| ÂÂÂÂÂÂÂÂÂÂÂÂ| ÂÂÂÂÂÂÂÂÂÂÂÂ| nanoseconds indicating how
>>> I know very little about the topic, but I don't understand how the spec
>>> as proposed allows an accurate reading of the relation between physical
>>> time and stolen time simultaneously. In other words, could you draw
>>> Figure 1 of the spec from within the guest? Or is it a non-objective?
>> Figure 1 is mostly attempting to explain Live Physical Time (LPT), which
>> is not part of this patch series. But it does touch on stolen time by
>> the difference between "live physical time" and "virtual time".
>> I'm not sure what you mean by "from within the guest". From the
>> perspective of the guest the parts of the diagram where the guest isn't
>> running don't exist (therefore there are discontinuities in the
>> "physical time" and "live physical time" lines).
> I meant: If I run code within the guest that attempts to draw Figure 1,
> race conditions may cause the diagram actually drawn by your guest
> program to look completely wrong on occasions.
>> This patch series doesn't attempt to provide the guest with a view of
>> "physical time" (or LPT) - but it might be able to observe that by
>> consulting something external (e.g. an NTP server, or an emulated RTC
>> which reports wall-clock time).
> â with what appear to be like a built-in race condition, as you correctly
> identified. I was wondering if the built-in race condition was deliberate
> and/or necessary, or if it was irrelevant for the planned uses of the value.
>> What it does provide is a mechanism for obtaining the difference (as
>> reported by the host) between "live physical time" and "virtual time" -
>> this is reported in nanoseconds in the above structure.
>>> For example, if you read the stolen time before you read CNTVCT_EL0,
>>> isn't it possible for a lengthy event like a migration to occur between
>>> the two reads, causing the stolen time to be obsolete and off by seconds?
>> "Lengthy events" like migration are represented by the "paused" state in
>> the diagram - i.e. it's the difference between "physical time" and "live
>> physical time". So stolen time doesn't attempt to represent that.
>> And yes, there is a race between reading CNTVCT_EL0 and reading stolen
>> time - but in practice this doesn't really matter. The usual pseudo-code
>> way of using stolen time is:
> Iâm assuming this is the guest scheduler you are talking about,


> and Iâm assuming virtualization can preempt that code anywhere.
> Maybe thatâs where Iâm wrong?

You are correct, the guest can be preempted at any point.

> For the sake of the argument, assume there is a 1s pause.
> Not completely unreasonable in a migration scenario.

As I mentioned before, events like migration are not represented by
stolen time. They would be represented by CNTVCT_EL0 appearing to pause
during the migration (so showing a difference between "physical time"
and "live physical time"). The stolen time value would not be incremented.

>> Â* scheduler captures stolen time from structure and CNTVCT_EL0:
>> ÂÂÂÂÂbefore_timer = CNTVCT_EL0
> [insert optional 1s pause here, case A]
>> ÂÂÂÂÂbefore_stolen = stolen
>> Â* schedule in process
>> Â* process is pre-empted (or blocked in some way)
>> Â* scheduler captures stolen time from structure and CNTVCT_EL0:
>> ÂÂÂÂÂafter_timer = CNTVCT_EL0
> [insert optional 1s pause here, case B]
>> ÂÂÂÂÂafter_stolen = stolen
>> ÂÂÂÂÂtime = to_nsecs(after_timer - before_timer) -
>> ÂÂÂÂÂÂÂÂÂÂÂÂ(after_stolen - before_stolen)
> In case A, time is too big by one second. In case B, it is too small,
> to the point where your code might need to be ready for
> âtimeâ unexpectedly showing up as negative.

So a 1 second pause is unlikely for stolen time - this means that the
VCPU was ready to run, but the host didn't run it for some reason. But
in theory you are correct this could happen. The core code deals with it
like this (update_rq_clock_task):
> if (static_key_false((&paravirt_steal_rq_enabled))) {
> steal = paravirt_steal_clock(cpu_of(rq));
> steal -= rq->prev_steal_time_rq;
> if (unlikely(steal > delta))
> steal = delta;
> rq->prev_steal_time_rq += steal;
> delta -= steal;
> }

So if (steal > delta) then steal is capped to delta, preventing the
final delta from going negative.

>> The scheduler can then charge the process for "time" nanoseconds of
>> time. This ensures that a process isn't unfairly penalised if the host
>> doesn't schedule the VCPU while it is supposed to be running.
>> The race is very small in comparison to the time the process is running,
>> and in the worst case just means the process is charged slightly more
>> (or less) than it should be.
> At this point, what I donât understand is why the race would be
> âvery smallâ or why you would only be charged âslightlyâ more or less?

The window between measuring the time using CNTVCT_EL0 and getting the
stolen time from the hypervisor is pretty short. The amount of time that
is (normally) stolen in one go is also small. So the race is unlikely
and the error when it occurs is (usually) small.

Long events (such as migration or pausing the guest) are not considered
"stolen time" and should be reflected to the guest in other ways.

>> I guess if you're really worried about it, you could do a dance like:
>> do {
>> before = stolen
>> timer = CNTVCT_EL0
>> after = stolen
>> } while (before != after);
> That will work as long as nothing in that loop requires something
> that would cause `stolen` to jump. If there is such a guarantee,
> then thatâs even efficient, because in most cases the loop
> would only run once, at the cost of one extra read and one test.

Note that other architectures don't have such loops, so arm64 is just
following the lead of existing architecture.

>> But I don't see the need to have such an accurate view of elapsed time
>> that the VCPU was scheduled. And of course at the moment (without this
>> series) the guest has no idea about time stolen by the host.
> Iâm certainly not arguing that exposing stolen time is a bad idea,
> Iâm only wondering if the proposed solution is racy, and if so, if
> it is intentional.
> If itâs indeed racy, the problem could be mitigated in a number of
> ways
> a) document your loop or something similar as being the recommended
> way to avoid the race, and then ensure that the loop actually
> will always work as intended. The upside is that itâs just a change in
> some comments or documentation.
> b) having a single interface that exposes multiple times. For example,
> you could have a copy of CNTVCT_EL0 written alongside stolen time,
> and then the scheduler could use that copy for its decision.

That would still be racy - the structure can be updated at any time (as
the host could interrupt the VCPU at any time), so you would still be
left with the problem of reading both atomically - which would mean
going back to the loop. This is the approach that LPT takes and is
documented in the spec.

Also I can't see why you would want to know the CNTVCT_EL0 value at the
point the stolen time was updated, it's much more useful to know the
current CNTVCT_EL0 value.

Ultimately reading the stolen time is always going to be slightly racy
because you are including some of the scheduler's time in the
calculation of how much time the process was running for. The pauses you
describe above are instances where time has been stolen from the
scheduler, but that time is being accounted for/against a user space
process. While the algorithm could be changed so that it's always a
positive for the user space process I'm not sure that's a benefit (it's
probably better that statistically it can go either way).