Re: [GIT PULL] First batch of KVM changes for 4.1

From: Andy Lutomirski
Date: Fri Apr 17 2015 - 16:40:24 EST


On Fri, Apr 17, 2015 at 1:18 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote:
>>
>>
>> >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
>> >> From: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>> >> Date: Fri, 17 Apr 2015 14:57:34 +0200
>> >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
>> >>
>> >> The task migration notifier is only used in x86 paravirt. Make it
>> >> possible to compile it out.
>> >>
>> >> While at it, move some code around to ensure tmn is filled from CPU
>> >> registers.
>> >>
>> >> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>> >> ---
>> >> arch/x86/Kconfig | 1 +
>> >> init/Kconfig | 3 +++
>> >> kernel/sched/core.c | 9 ++++++++-
>> >> 3 files changed, 12 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> index d43e7e1c784b..9af252c8698d 100644
>> >> --- a/arch/x86/Kconfig
>> >> +++ b/arch/x86/Kconfig
>> >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
>> >>
>> >> config PARAVIRT
>> >> bool "Enable paravirtualization code"
>> >> + select TASK_MIGRATION_NOTIFIER
>> >> ---help---
>> >> This changes the kernel so it can modify itself when it is run
>> >> under a hypervisor, potentially improving performance significantly
>> >> diff --git a/init/Kconfig b/init/Kconfig
>> >> index 3b9df1aa35db..891917123338 100644
>> >> --- a/init/Kconfig
>> >> +++ b/init/Kconfig
>> >> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
>> >> config PREEMPT_NOTIFIERS
>> >> bool
>> >>
>> >> +config TASK_MIGRATION_NOTIFIER
>> >> + bool
>> >> +
>> >> config PADATA
>> >> depends on SMP
>> >> bool
>> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> >> index f9123a82cbb6..c07a53aa543c 100644
>> >> --- a/kernel/sched/core.c
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
>> >> rq_clock_skip_update(rq, true);
>> >> }
>> >>
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >> static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
>> >>
>> >> void register_task_migration_notifier(struct notifier_block *n)
>> >> {
>> >> atomic_notifier_chain_register(&task_migration_notifier, n);
>> >> }
>> >> +#endif
>> >>
>> >> #ifdef CONFIG_SMP
>> >> void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>> >> trace_sched_migrate_task(p, new_cpu);
>> >>
>> >> if (task_cpu(p) != new_cpu) {
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >> struct task_migration_notifier tmn;
>> >> + int from_cpu = task_cpu(p);
>> >> +#endif
>> >>
>> >> if (p->sched_class->migrate_task_rq)
>> >> p->sched_class->migrate_task_rq(p, new_cpu);
>> >> p->se.nr_migrations++;
>> >> perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
>> >>
>> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
>> >> tmn.task = p;
>> >> - tmn.from_cpu = task_cpu(p);
>> >> + tmn.from_cpu = from_cpu;
>> >> tmn.to_cpu = new_cpu;
>> >>
>> >> atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
>> >> +#endif
>> >> }
>> >>
>> >> __set_task_cpu(p, new_cpu);
>> >> --
>> >> 2.3.5
>> >
>> > Paolo,
>> >
>> > Please revert the patch -- can fix properly in the host
>> > which also conforms the KVM guest/host documented protocol.
>> >
>> > Radim submitted a patch to kvm@ to split
>> > the kvm_write_guest in two with a barrier in between, i think.
>> >
>> > I'll review that patch.
>>
>> You're thinking of
>> http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
>> Andy's reply:
>>
>> >
>> > I think there are at least two ways that would work:
>> >
>> > a) If KVM incremented version as advertised:
>> >
>> > cpu = getcpu();
>> > pvti = pvti for cpu;
>> >
>> > ver1 = pvti->version;
>> > check stable bit;
>> > rdtsc_barrier, rdtsc, read scale, shift, etc.
>> > if (getcpu() != cpu) retry;
>> > if (pvti->version != ver1) retry;
>> >
>> > I think this is safe because, we're guaranteed that there was an
>> > interval (between the two version reads) in which the vcpu we think
>> > we're on was running and the kvmclock data was valid and marked
>> > stable, and we know that the tsc we read came from that interval.
>> >
>> > Note: rdtscp isn't needed. If we're stable, is makes no difference
>> > which cpu's tsc we actually read.
>> >
>> > b) If version remains buggy but we use this migrations_from hack:
>> >
>> > cpu = getcpu();
>> > pvti = pvti for cpu;
>> > m1 = pvti->migrations_from;
>> > barrier();
>> >
>> > ver1 = pvti->version;
>> > check stable bit;
>> > rdtsc_barrier, rdtsc, read scale, shift, etc.
>> > if (getcpu() != cpu) retry;
>> > if (pvti->version != ver1) retry; /* probably not really needed */
>> >
>> > barrier();
>> > if (pvti->migrations_from != m1) retry;
>> >
>> > This is just like (a), except that we're using a guest kernel hack to
>> > ensure that no one migrated off the vcpu during the version-protected
>> > critical section and that we were, in fact, on that vcpu at some point
>> > during that critical section. Once we've ensured that we were on
>> > pvti's associated vcpu for the entire time we were reading it, then we
>> > are protected by the existing versioning in the host.
>>
>> (a) is not going to happen until 4.2, and there are too many buggy hosts
>> around so we'd have to define new ABI that lets the guest distinguish a
>> buggy host from a fixed one.
>>
>> (b) works now, is not invasive, and I still maintain that the cost is
>> negligible. I'm going to run for a while with CONFIG_SCHEDSTATS to see
>> how often you have a migration.
>>
>> Anyhow if the task migration notifier is reverted we have to disable the
>> whole vsyscall support altogether.
>
> The bug which this is fixing is very rare, have no memory of a report.
>
> In fact, its even difficult to create a synthetic reproducer. You need:
>
> 1) update of kvmclock data structure (happens once every 5 minutes).
> 2) migration of task from vcpu1 to vcpu2 back to vcpu1.
> 3) a data race between kvm_write_guest (string copy) and
> 2 above.
>
> At the same time.

Something maybe worth considering:

On my box, vclock_gettime using kvm-clock is about 40 ns. An empty
syscall is about 33 ns. clock_gettime *should* be around 17 ns.

The clock_gettime syscall is about 73 ns.

Could we figure out why clock_gettime (the syscall) is so slow, fix
it, and then not be so sad about removing the existing kvm-clock vdso
code? Once we fix the host for real (add a new feature bit and make
it global instead of per-cpu), then we could have a really fast vdso
implementation, too.

--Andy

>
>



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/