Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

From: Jason Vas Dias
Date: Wed Mar 14 2018 - 08:55:30 EST


Thanks for the helpful comments, Peter -
re:
On 14/03/2018, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>> Yes, I am sampling perf counters,
>
> You're not in fact sampling, you're just reading the counters.

Correct, using Linux-ese terminology - but "sampling" in looser English.


>> Reading performance counters does involve 2 ioctls and a read() ,
>
> So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and
> just let them run and do:
>
> read(group_fd, &buf_pre, size);
> /* your code section */
> read(group_fd, &buf_post, size);
>
> /* compute buf_post - buf_pre */
>
> Which is only 2 system calls, not 4.

But I can't, really - I am trying to restrict the
performance counter measurements
to only a subset of the code, and exclude
performance measurement result processing -
so the timeline is like:
struct timespec t_start, t_end;
perf_event_open(...);
thread_main_loop() { ... do {
t _ clock_gettime(CLOCK_MONOTONIC_RAW, &t_start);
t+x _ enable_perf ();
total_work = do_some_work();
disable_perf ();
clock_gettime(CLOCK_MONOTONIC_RAW, &t_end);
t+y_
read_perf_counters_and_store_results
( perf_grp_fd, &results , total_work,
TS2T( &t_end ) - TS2T( &t_start)
);
} while (.... );
}

Now. here the bandwidth / performance results recorded by
my 'read_perf_counters_and_store_results' method
is very sensitive to the measurement of the OUTER
elapsed time .

>
> Also, a while back there was the proposal to extend the mmap()
> self-monitoring interface to groups, see:
>
> https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowokjz@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> I never did get around to writing the actual code for it, but it
> shouldn't be too hard.
>

Great, I'm looking forward to trying it - but meanwhile,
to get NON-MULTIPLEXED measurements for the SAME CODE SEQUENCE
over the SAME TIME I believe the group FD method is what is implemented
and what works.


>> The CPU_CLOCK software counter should give the converted TSC cycles
>> seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
>> and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
>> difference between the event->time_running and time_enabled
>> should also measure elapsed time .
>
> While CPU_CLOCK is TSC based, there is no guarantee it has any
> correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based).
>
> (although, I think I might have fixed that recently and it might just
> work, but it's very much not guaranteed).

Yes, I believe the CPU_CLOCK is effectively the converted TSC -
it does appear to correlate well with the new CLOCK_MONOTONIC_RAW
values from the patched VDSO.

> If you want to correlate to CLOCK_MONOTONIC_RAW you have to read
> CLOCK_MONOTONIC_RAW and not some random other clock value.
>

Exactly ! Hence the need for the patch so that users can get
CLOCK_MONOTONIC_RAW values with low latency and correlate them
with PERF CPU_CLOCK values.

>> This gives the "inner" elapsed time, from the perpective of the kernel,
>> while the measured code section had the counters enabled.
>>
>> But unless the user-space program also has a way of measuring elapsed
>> time from the CPU's perspective , ie. without being subject to
>> operator or NTP / PTP adjustment, it has no way of correlating this
>> inner elapsed time with any "outer"
>
> You could read the time using the group_fd's mmap() page. That actually
> includes the TSC mult,shift,offset as used by perf clocks.
>

Yes, but as mentioned earlier, that presupposes I want to use the mmap()
sample method - I don't - I want to use the Group FD method, so
that I can be sure the measurements are for the same code sequence
over the same period of time.

>> Currently, users must parse the log file or use gdb / objdump to
>> inspect /proc/kcore to get the TSC calibration and exact
>> mult+shift values for the TSC value conversion.
>
> Which ;-) there's multiple floating around..
>

Yes, but why must Linux make it so difficult ?
I think it has to be recognized that the vDSO or user-space program
are the only places in which low-latency clock values can be generated
for use by user-space programs with sufficiently low latencies to be useful.
So why does it not export the TSC calibration which is so complex to
calibrate when such calibration information is available nowhere else ?


>> Intel does not publish, nor does the CPU come with in ROM or firmware,
>> the actual precise TSC frequency - this must be calibrated against the
>> other clocks , according to a complicated procedure in section 18.2 of
>> the SDM . My TSC has a "rated" / nominal TSC frequency , which one
>> can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
>> is 2.8333ghz .
>
> You might want to look at commit:
>
> b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon")
>
> There is no such thing as a precise TSC frequency, there's a reason we
> have NTP/PTP.
>

By my reading of the SDM, yes, the TSC frequency IS quite precise and
fixed , especially on modern CPUs with Always Running Timer (ART)
hardware, but its precise frequency is not known in advance and
must be calibrated - but once known, it is quite precise & fixed at
that value.

I want to count the number of ticks the CPU thinks has elapsed,
regardless of how fast or slow it is currently running , NOT the
NTP / PTP time.


ntpd itself is running very happily and is reporting MUCH lower
offsets and drift with the VDSO CLOCK_MONOTONIC_RAW
patch applied.

>> Hence I think Linux should export this calibrated frequency somehow ;
>> its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
>> values, and is exported to the VDSO .
>>
>> I think the VDSO should read the TSC and use the calibration
>> to render the raw, unadjusted time from the CPU's perspective.
>>
>> Hence, the patch I am preparing , which is again attached.
>
> I have no objection to adding CLOCK_MONOTONIC_RAW support to the VDSO,
> but you seem to be rather confused on how things work.
>
> Now, if you wanted to actually have CLOCK_MONOTONIC_RAW times from perf
> you'd need something like the below patch.
>
> You'd need to create your events with:
>
> attr.use_clockid = 1;
> attr.clockid = CLOCK_MONOTONIC_RAW;
> attr.read_format |= PERF_FORMAT_TIME;
>

Yes, I am doing this, but I want to also be able to read
CLOCK_MONOTONIC_RAW using clock_gettime()
with low latency. I do not think this is an unreasonable desire
or an expectation not encouraged by the documentation.

> But whatever you do, you really have to stop mixing clocks, that's
> broken, even if it magically works for now.
>

That is precisely what I am trying to avoid.

When comparing the CPU-perspective CPU_CLOCK values
with any other time values, I want to make sure a low-latency
non NTP / PTP adjusted value is used.

I hope someone other than me can see the sense in above.

This is the latency I measure now with the timer_latency program
under 4.15.9 with the patch applied:
<quote><pre>
$ ./timer_latency
sum: 2313
Total time: 0.000002313S - Average Latency: 0.000000023S
$ ./timer_latency -m
sum: 5556
Total time: 0.000005556S - Average Latency: 0.000000055S


whereas before the patch was applied I measured latencies
of 300-1000ns for the CLOCK_MONOTONIC_RAW clock source.

This in itself to me makes the patch worth applying. I hope others
can see the sense in that.

Thanks & Best Regards,
Jason
/*
* Program to measure high-res timer latency.
*
*/
#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h>
#include <unistd.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>

#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)

int main(int argc, char *const* argv, char *const* envp)
{ clockid_t clk = CLOCK_MONOTONIC_RAW;
bool do_dump = false;
int argn=1;
for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
if( *(argv[argn]) == '-')
switch( *(argv[argn]+1) )
{ case 'm':
case 'M':
clk = CLOCK_MONOTONIC;
break;
case 'd':
case 'D':
do_dump = true;
break;
case '?':
case 'h':
case 'u':
case 'U':
case 'H':
fprintf(stderr,"Usage: timer_latency [ -m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW) ; -d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
);
return 0;
}
struct timespec sample[N_SAMPLES+1];
unsigned int cnt=N_SAMPLES, s=0 ;
do
{ if( 0 != clock_gettime(clk, &sample[s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
return 1;
}
}while( --cnt );
clock_gettime(clk, &sample[s]);
#define TS2NS(_TS_) ((((unsigned long long)(_TS_).tv_sec)*1000000000ULL) + (((unsigned long long)((_TS_).tv_nsec))))
unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0
, t_start = TS2NS(sample[0]);
for(s=1; s < (N_SAMPLES+1); s+=1)
{ t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( t1 > t2 )
{ fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2
, sample[s-1].tv_sec, sample[s-1].tv_nsec
, sample[s].tv_sec, sample[s].tv_nsec
);
continue;
}
unsigned long long d =t2-t1;
if ( d == 0 )
{ if( zd == 0 )
fprintf(stderr, "0 delta!\n");
zd += 1;
}
deltas[s-1] = d;
if(do_dump)
fprintf(stderr, "%lu %lu %llu\n",
sample[s].tv_sec, sample[s].tv_nsec, d
);
}
if( zd > 0 )
fprintf(stderr,"%u 0 deltas\n");
for(s = 0; s < N_SAMPLES; s+=1)
sum += deltas[s];
fprintf(stderr,"sum: %llu\n",sum);
unsigned long long avg_ns = sum / N_SAMPLES;
t1=(t2 - t_start);
printf("Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n",
t1 / 1000000000, t1 % 1000000000,
avg_ns / 1000000000, avg_ns % 1000000000
);
return 0;
}