Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

From: Shaohua Li
Date: Wed Dec 10 2014 - 16:57:34 EST


On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <shli@xxxxxx> wrote:
> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
> > use the following method to compute the thread cpu time:
>
> I like the idea, and I like making this type of profiling fast. I
> don't love the implementation because it's an information leak (maybe
> we don't care) and it's ugly.
>
> The info leak could be fixed completely by having a per-process array
> instead of a global array. That's currently tricky without wasting
> memory, but it could be created on demand if we wanted to do that,
> once my vvar .fault patches go in (assuming they do -- I need to ping
> the linux-mm people).

those info leak really doesn't matter. But we need the global array
anyway. The context switch detection should be per-cpu data and should
be able to access in remote cpus.

> As for ugliness, it seems to me that there really ought to be a better
> way to do this. What we really want is a per-thread vvar-like thing.
> Any code that can use TLS can do this trivially, but the vdso,
> unfortunately, has no straightforward way to use TLS.
>
> If we could rely on something like TSX (ha!), then this becomes straightforward:
>
> xbegin()
> rdtscp
> read params
> xcommit()
>
> But we can't do that for the next decade or so, obviously.
>
> If we were willing to limit ourselves to systems with rdwrgsbase, then
> we could do:
>
> save gs and gsbase
> mov $__VDSO_CPU,%gs
> mov %gs:(base our array),%whatever
> restore gs and gsbase
>
> (We actually could do this if we make arch_prctl disable this feature
> the first time a process tries to set the gs base.)
>
> If we were willing to limit ourselves to at most 2^20 threads per
> process, then we could assign each thread a 20-bit index within its
> process with some LSL trickery. Then we could have the vdso look up
> the thread in a per-process array with one element per thread. This
> would IMO be the winning approach if we think we'd ever extend the set
> of per-thread vvar things. (For the 32-bit VDSO, this is trivial --
> just use a far pointer.)
>
> If we has per-cpu fixmap entries, we'd use that. Unfortunately, that
> would be a giant mess to implement, and it would be a slowdown for
> some workloads (and a speedup for others - hmm).
>
> Grumble. Thoughts?
>
> --Andy
>
> >
> > t0 = process start
> > t1 = most recent context switch time
> > t2 = time at which the vsyscall is invoked
> >
> > thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
> > = current->se.sum_exec_runtime + now - sched_clock()
> >
> > At context switch time We stash away
> >
> > adj_sched_time = sum_exec_runtime - sched_clock()
> >
> > in a per-cpu struct in the VVAR page and then compute
> >
> > thread_cpu_time = adj_sched_time + now
> >
> > All computations are done in nanosecs on systems where TSC is stable. If
> > TSC is unstable, we fallback to a regular syscall.
> > Benchmark data:
> >
> > for (i = 0; i < 100000000; i++) {
> > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
> > sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
> > }
> >
> > Baseline:
> > real 1m3.428s
> > user 0m5.190s
> > sys 0m58.220s
> >
> > patched:
> > real 0m4.912s
> > user 0m4.910s
> > sys 0m0.000s
> >
> > This should speed up profilers that need to query thread cpu time a lot
> > to do fine-grained timestamps.
> >
> > No statistically significant regression was detected on x86_64 context
> > switch code. Most archs that don't support vsyscalls will have this code
> > disabled via jump labels.
> >
> > Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> > Cc: H. Peter Anvin <hpa@xxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Signed-off-by: Kumar Sundararajan <kumar@xxxxxx>
> > Signed-off-by: Arun Sharma <asharma@xxxxxx>
> > Signed-off-by: Chris Mason <clm@xxxxxx>
> > Signed-off-by: Shaohua Li <shli@xxxxxx>
> > ---
> > arch/x86/include/asm/vdso.h | 9 +++++++-
> > arch/x86/kernel/tsc.c | 14 ++++++++++++
> > arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
> > arch/x86/vdso/vma.c | 4 ++++
> > kernel/sched/core.c | 3 +++
> > 5 files changed, 78 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
> > index 42d6d2c..cbbab3a 100644
> > --- a/arch/x86/include/asm/vdso.h
> > +++ b/arch/x86/include/asm/vdso.h
> > @@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
> > struct vdso_percpu_data {
> > /* layout: | cpu ID | context switch count | */
> > unsigned long cpu_cs_count;
> > +
> > + unsigned long long adj_sched_time;
> > + unsigned long long cyc2ns_offset;
> > + unsigned long cyc2ns;
>
> Having two offsets seems unnecessary.
>
> > } ____cacheline_aligned;
> >
> > struct vdso_data {
> > - int dummy;
> > + unsigned int thread_cputime_disabled;
> > struct vdso_percpu_data vpercpu[0];
> > };
> > extern struct vdso_data vdso_data;
> >
> > +struct static_key;
> > +extern struct static_key vcpu_thread_cputime_enabled;
> > +
> > #ifdef CONFIG_SMP
> > #define VDSO_CS_COUNT_BITS \
> > (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index b7e50bb..83f8091 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -21,6 +21,7 @@
> > #include <asm/hypervisor.h>
> > #include <asm/nmi.h>
> > #include <asm/x86_init.h>
> > +#include <asm/vdso.h>
> >
> > unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
> > EXPORT_SYMBOL(cpu_khz);
> > @@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> > data->cyc2ns_offset = ns_now -
> > mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
> >
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > + vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
> > + vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
> > +#endif
> > +
> > cyc2ns_write_end(cpu, data);
> >
> > done:
> > @@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
> > tsc_unstable = 1;
> > clear_sched_clock_stable();
> > disable_sched_clock_irqtime();
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > + vdso_data.thread_cputime_disabled = 1;
> > + static_key_slow_dec(&vcpu_thread_cputime_enabled);
> > +#endif
> > pr_info("Marking TSC unstable due to %s\n", reason);
> > /* Change only the rating, when not registered */
> > if (clocksource_tsc.mult)
> > @@ -1202,6 +1212,10 @@ void __init tsc_init(void)
> >
> > tsc_disabled = 0;
> > static_key_slow_inc(&__use_tsc);
> > +#ifdef CONFIG_VDSO_CS_DETECT
> > + vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
> > + X86_FEATURE_RDTSCP);
> > +#endif
>
> This is backwards IMO. Even if this function never runs, the flag
> should be set to disable the feature. Then you can enable it here.
>
> >
> > if (!no_sched_irq_time)
> > enable_sched_clock_irqtime();
> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> > index 438b3be..46c03af2 100644
> > --- a/arch/x86/vdso/vclock_gettime.c
> > +++ b/arch/x86/vdso/vclock_gettime.c
> > @@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
> > smp_rmb();
> > return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
> > }
> > +
> > +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> > +notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
> > + unsigned long long scale,
> > + unsigned long long offset)
> > +{
> > + unsigned long long ns = offset;
> > + ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
> > + return ns;
> > +}
> > +
> > +notrace static unsigned long do_thread_cpu_time(void)
> > +{
> > + unsigned int p;
> > + u_int64_t tscval;
> > + unsigned long long adj_sched_time;
> > + unsigned long long scale;
> > + unsigned long long offset;
> > + const struct vdso_data *vp = &VVAR(vdso_data);
> > + int cpu;
> > + unsigned long cs;
> > +
> > + do {
> > + cs = __vdso_get_context_switch_count();
> > +
> > + rdtscpll(tscval, p);
> > + cpu = p & VGETCPU_CPU_MASK;
> > + adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
> > + scale = vp->vpercpu[cpu].cyc2ns;
> > + offset = vp->vpercpu[cpu].cyc2ns_offset;
> > +
> > + } while (unlikely(cs != __vdso_get_context_switch_count()));
> > +
> > + return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;
>
> You're literally adding two offsets together here, although it's
> obscured by the abstraction of __cycles_2_ns.

I don't understand 'two offsets' stuff, could you please give more
details?

> Also, if you actually expand this function out, it's not optimized
> well. You're doing, roughly:
>
> getcpu
> read cs count
> rdtscp
> read clock params
> getcpu
> read cs count
> repeat if necessary
>
> You should need at most two operations that read the cpu number. You could do:
>
> getcpu
> read cs count
> rdtscp
> read clock params
> barrier()
> check that both cpu numbers agree and that the cs count hasn't changed
>
> All those smp barriers should be unnecessary, too. There's no SMP here.
>
> Also, if you do this, then the high bits of the cs count can be removed.
> It may also make sense to bail and use the fallback after a couple
> failures. Otherwise, in some debuggers, you will literally never
> finish the loop.

ok, makes sense.

> I am curious, though: why are you using sched clock instead of
> CLOCK_MONOTONIC? Is it because you can't read CLOCK_MONOTONIC in the
> scheduler?

The sched clock is faster, I guess.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/