Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO

From: Andy Lutomirski
Date: Wed Dec 10 2014 - 17:13:50 EST

Next message: Benjamin Tissoires: "[PATCH 1/2] HID: logitech-hidpp: do not return the name length"
Previous message: Tejun Heo: "Re: [PATCHSET -mm] bitmap, cpumask, nodemask: implement pr_cont functions and use di""
In reply to: Shaohua Li: "Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO"
Next in thread: Shaohua Li: "Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Dec 10, 2014 at 1:57 PM, Shaohua Li <shli@xxxxxx> wrote:
> On Wed, Dec 10, 2014 at 11:10:52AM -0800, Andy Lutomirski wrote:
>> On Sun, Dec 7, 2014 at 7:03 PM, Shaohua Li <shli@xxxxxx> wrote:
>> > This primarily speeds up clock_gettime(CLOCK_THREAD_CPUTIME_ID, ..). We
>> > use the following method to compute the thread cpu time:
>>
>> I like the idea, and I like making this type of profiling fast. I
>> don't love the implementation because it's an information leak (maybe
>> we don't care) and it's ugly.
>>
>> The info leak could be fixed completely by having a per-process array
>> instead of a global array. That's currently tricky without wasting
>> memory, but it could be created on demand if we wanted to do that,
>> once my vvar .fault patches go in (assuming they do -- I need to ping
>> the linux-mm people).
>
> those info leak really doesn't matter.

Why not?

> But we need the global array
> anyway. The context switch detection should be per-cpu data and should
> be able to access in remote cpus.

Right, but the whole array could be per process instead of global.

I'm not saying I'm sure that would be better, but I think it's worth
considering.

>
>> As for ugliness, it seems to me that there really ought to be a better
>> way to do this. What we really want is a per-thread vvar-like thing.
>> Any code that can use TLS can do this trivially, but the vdso,
>> unfortunately, has no straightforward way to use TLS.
>>
>> If we could rely on something like TSX (ha!), then this becomes straightforward:
>>
>> xbegin()
>> rdtscp
>> read params
>> xcommit()
>>
>> But we can't do that for the next decade or so, obviously.
>>
>> If we were willing to limit ourselves to systems with rdwrgsbase, then
>> we could do:
>>
>> save gs and gsbase
>> mov $__VDSO_CPU,%gs
>> mov %gs:(base our array),%whatever
>> restore gs and gsbase
>>
>> (We actually could do this if we make arch_prctl disable this feature
>> the first time a process tries to set the gs base.)
>>
>> If we were willing to limit ourselves to at most 2^20 threads per
>> process, then we could assign each thread a 20-bit index within its
>> process with some LSL trickery. Then we could have the vdso look up
>> the thread in a per-process array with one element per thread. This
>> would IMO be the winning approach if we think we'd ever extend the set
>> of per-thread vvar things. (For the 32-bit VDSO, this is trivial --
>> just use a far pointer.)
>>
>> If we has per-cpu fixmap entries, we'd use that. Unfortunately, that
>> would be a giant mess to implement, and it would be a slowdown for
>> some workloads (and a speedup for others - hmm).
>>
>> Grumble. Thoughts?
>>
>> --Andy
>>
>> >
>> > t0 = process start
>> > t1 = most recent context switch time
>> > t2 = time at which the vsyscall is invoked
>> >
>> > thread_cpu_time = sum(time slices between t0 to t1) + (t2 - t1)
>> > = current->se.sum_exec_runtime + now - sched_clock()
>> >
>> > At context switch time We stash away
>> >
>> > adj_sched_time = sum_exec_runtime - sched_clock()
>> >
>> > in a per-cpu struct in the VVAR page and then compute
>> >
>> > thread_cpu_time = adj_sched_time + now
>> >
>> > All computations are done in nanosecs on systems where TSC is stable. If
>> > TSC is unstable, we fallback to a regular syscall.
>> > Benchmark data:
>> >
>> > for (i = 0; i < 100000000; i++) {
>> > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
>> > sum += ts.tv_sec * NSECS_PER_SEC + ts.tv_nsec;
>> > }
>> >
>> > Baseline:
>> > real 1m3.428s
>> > user 0m5.190s
>> > sys 0m58.220s
>> >
>> > patched:
>> > real 0m4.912s
>> > user 0m4.910s
>> > sys 0m0.000s
>> >
>> > This should speed up profilers that need to query thread cpu time a lot
>> > to do fine-grained timestamps.
>> >
>> > No statistically significant regression was detected on x86_64 context
>> > switch code. Most archs that don't support vsyscalls will have this code
>> > disabled via jump labels.
>> >
>> > Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>> > Cc: H. Peter Anvin <hpa@xxxxxxxxx>
>> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> > Signed-off-by: Kumar Sundararajan <kumar@xxxxxx>
>> > Signed-off-by: Arun Sharma <asharma@xxxxxx>
>> > Signed-off-by: Chris Mason <clm@xxxxxx>
>> > Signed-off-by: Shaohua Li <shli@xxxxxx>
>> > ---
>> > arch/x86/include/asm/vdso.h | 9 +++++++-
>> > arch/x86/kernel/tsc.c | 14 ++++++++++++
>> > arch/x86/vdso/vclock_gettime.c | 49 ++++++++++++++++++++++++++++++++++++++++++
>> > arch/x86/vdso/vma.c | 4 ++++
>> > kernel/sched/core.c | 3 +++
>> > 5 files changed, 78 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
>> > index 42d6d2c..cbbab3a 100644
>> > --- a/arch/x86/include/asm/vdso.h
>> > +++ b/arch/x86/include/asm/vdso.h
>> > @@ -54,14 +54,21 @@ extern void __init init_vdso_image(const struct vdso_image *image);
>> > struct vdso_percpu_data {
>> > /* layout: | cpu ID | context switch count | */
>> > unsigned long cpu_cs_count;
>> > +
>> > + unsigned long long adj_sched_time;
>> > + unsigned long long cyc2ns_offset;
>> > + unsigned long cyc2ns;
>>
>> Having two offsets seems unnecessary.
>>
>> > } ____cacheline_aligned;
>> >
>> > struct vdso_data {
>> > - int dummy;
>> > + unsigned int thread_cputime_disabled;
>> > struct vdso_percpu_data vpercpu[0];
>> > };
>> > extern struct vdso_data vdso_data;
>> >
>> > +struct static_key;
>> > +extern struct static_key vcpu_thread_cputime_enabled;
>> > +
>> > #ifdef CONFIG_SMP
>> > #define VDSO_CS_COUNT_BITS \
>> > (sizeof(unsigned long) * 8 - NR_CPUS_BITS)
>> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
>> > index b7e50bb..83f8091 100644
>> > --- a/arch/x86/kernel/tsc.c
>> > +++ b/arch/x86/kernel/tsc.c
>> > @@ -21,6 +21,7 @@
>> > #include <asm/hypervisor.h>
>> > #include <asm/nmi.h>
>> > #include <asm/x86_init.h>
>> > +#include <asm/vdso.h>
>> >
>> > unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not used here */
>> > EXPORT_SYMBOL(cpu_khz);
>> > @@ -263,6 +264,11 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
>> > data->cyc2ns_offset = ns_now -
>> > mul_u64_u32_shr(tsc_now, data->cyc2ns_mul, CYC2NS_SCALE_FACTOR);
>> >
>> > +#ifdef CONFIG_VDSO_CS_DETECT
>> > + vdso_data.vpercpu[cpu].cyc2ns = data->cyc2ns_mul;
>> > + vdso_data.vpercpu[cpu].cyc2ns_offset = data->cyc2ns_offset;
>> > +#endif
>> > +
>> > cyc2ns_write_end(cpu, data);
>> >
>> > done:
>> > @@ -989,6 +995,10 @@ void mark_tsc_unstable(char *reason)
>> > tsc_unstable = 1;
>> > clear_sched_clock_stable();
>> > disable_sched_clock_irqtime();
>> > +#ifdef CONFIG_VDSO_CS_DETECT
>> > + vdso_data.thread_cputime_disabled = 1;
>> > + static_key_slow_dec(&vcpu_thread_cputime_enabled);
>> > +#endif
>> > pr_info("Marking TSC unstable due to %s\n", reason);
>> > /* Change only the rating, when not registered */
>> > if (clocksource_tsc.mult)
>> > @@ -1202,6 +1212,10 @@ void __init tsc_init(void)
>> >
>> > tsc_disabled = 0;
>> > static_key_slow_inc(&__use_tsc);
>> > +#ifdef CONFIG_VDSO_CS_DETECT
>> > + vdso_data.thread_cputime_disabled = !cpu_has(&boot_cpu_data,
>> > + X86_FEATURE_RDTSCP);
>> > +#endif
>>
>> This is backwards IMO. Even if this function never runs, the flag
>> should be set to disable the feature. Then you can enable it here.
>>
>> >
>> > if (!no_sched_irq_time)
>> > enable_sched_clock_irqtime();
>> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> > index 438b3be..46c03af2 100644
>> > --- a/arch/x86/vdso/vclock_gettime.c
>> > +++ b/arch/x86/vdso/vclock_gettime.c
>> > @@ -299,6 +299,48 @@ notrace unsigned long __vdso_get_context_switch_count(void)
>> > smp_rmb();
>> > return VVAR(vdso_data).vpercpu[cpu].cpu_cs_count;
>> > }
>> > +
>> > +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
>> > +notrace static inline unsigned long long __cycles_2_ns(unsigned long long cyc,
>> > + unsigned long long scale,
>> > + unsigned long long offset)
>> > +{
>> > + unsigned long long ns = offset;
>> > + ns += mul_u64_u32_shr(cyc, scale, CYC2NS_SCALE_FACTOR);
>> > + return ns;
>> > +}
>> > +
>> > +notrace static unsigned long do_thread_cpu_time(void)
>> > +{
>> > + unsigned int p;
>> > + u_int64_t tscval;
>> > + unsigned long long adj_sched_time;
>> > + unsigned long long scale;
>> > + unsigned long long offset;
>> > + const struct vdso_data *vp = &VVAR(vdso_data);
>> > + int cpu;
>> > + unsigned long cs;
>> > +
>> > + do {
>> > + cs = __vdso_get_context_switch_count();
>> > +
>> > + rdtscpll(tscval, p);
>> > + cpu = p & VGETCPU_CPU_MASK;
>> > + adj_sched_time = vp->vpercpu[cpu].adj_sched_time;
>> > + scale = vp->vpercpu[cpu].cyc2ns;
>> > + offset = vp->vpercpu[cpu].cyc2ns_offset;
>> > +
>> > + } while (unlikely(cs != __vdso_get_context_switch_count()));
>> > +
>> > + return __cycles_2_ns(tscval, scale, offset) + adj_sched_time;
>>
>> You're literally adding two offsets together here, although it's
>> obscured by the abstraction of __cycles_2_ns.
>
> I don't understand 'two offsets' stuff, could you please give more
> details?

I think you're only really using cyc2ns_offset + adj_sched_time, so it
seems unnecessary to store both.

--Andy

>
>> Also, if you actually expand this function out, it's not optimized
>> well. You're doing, roughly:
>>
>> getcpu
>> read cs count
>> rdtscp
>> read clock params
>> getcpu
>> read cs count
>> repeat if necessary
>>
>> You should need at most two operations that read the cpu number. You could do:
>>
>> getcpu
>> read cs count
>> rdtscp
>> read clock params
>> barrier()
>> check that both cpu numbers agree and that the cs count hasn't changed
>>
>> All those smp barriers should be unnecessary, too. There's no SMP here.
>>
>> Also, if you do this, then the high bits of the cs count can be removed.
>> It may also make sense to bail and use the fallback after a couple
>> failures. Otherwise, in some debuggers, you will literally never
>> finish the loop.
>
> ok, makes sense.
>
>> I am curious, though: why are you using sched clock instead of
>> CLOCK_MONOTONIC? Is it because you can't read CLOCK_MONOTONIC in the
>> scheduler?
>
> The sched clock is faster, I guess.
>
> Thanks,
> Shaohua

--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Benjamin Tissoires: "[PATCH 1/2] HID: logitech-hidpp: do not return the name length"
Previous message: Tejun Heo: "Re: [PATCHSET -mm] bitmap, cpumask, nodemask: implement pr_cont functions and use di""
In reply to: Shaohua Li: "Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO"
Next in thread: Shaohua Li: "Re: [PATCH 3/3] X86: Add a thread cpu time implementation to vDSO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]