Re: tip.today - scheduler bam boom crash (cpu hotplug)

From: Peter Zijlstra
Date: Mon Feb 27 2017 - 11:44:07 EST


On Mon, Feb 27, 2017 at 04:27:32PM +0100, Paolo Bonzini wrote:
>
>
> On 27/02/2017 14:04, Peter Zijlstra wrote:
> >>>> This results in sched clock always unstable for kvm guest since there
> >>>> is no invariant tsc cpuid bit exposed for kvm guest currently.
> >>> What the heck is KVM_FEATURE_CLOCKSOURCE_STABLE_BIT /
> >>> PVCLOCK_TSC_STABLE_BIT about then?
> >> It checks that all the bugs in the host have been ironed out, and that
> >> the host itself supports invtsc.
> > But what does it mean if that is not so? That is, will kvm_clock_read()
> > still be stable even if !stable?
>
> If kvmclock is !stable, nobody should have set that the sched clock to
> stable, to begin with.

OK, so if !KVM_FEATURE_CLOCKSOURCE_STABLE_BIT nothing is stable, but if
it is set, TSC might still not be stable, but kvm_clock_read() is.

> However, if kvmclock is stable, we know that the sched clock is stable.

Right, so the problem is that we only ever want to allow marking
unstable -- once its found unstable, for whatever reason, we should
never allow going stable. The corollary of this proposition is that we
must start out assuming it will become stable. And to avoid actually
using unstable TSC we do a 3 state bringup:

1) sched_clock_running = 0, __stable_early = 1, __stable = 0
2) sched_clock_running = 1 (__stable is effective, iow, we run unstable)
3) sched_clock_running = 2 (__stable <- __stable_early)

2) happens 'early' but is 'safe'.
3) happens 'late', after we've brought up SMP and probed TSC

Between there, we should have detected the most common TSC wreckage and
made sure to not then switch to 'stable' at 3.

Now the problem appears to be that we assume sched_clock will use RDTSC
(native_sched_clock) while sched_clock is a paravirt op.

Now, I've not yet figured out the ordering between when we set
pv_time_ops.sched_clock and when we do the 'normal' TSC init stuff.

But it appears to me, we should not be calling
clear_sched_clock_stable() on TSC bits when we don't end up using
native_sched_clock().