Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW
From: Jason Vas Dias
Date: Tue Mar 13 2018 - 19:45:53 EST
On 12/03/2018, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote:
>> Sometimes, particularly when correlating elapsed time to performance
>> counter values,
>
> So what actual problem are you tring to solve here? Perf can already
> give you sample time in various clocks, including MONOTONIC_RAW.
>
>
Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS,
CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with
perf_event_open() , for the current thread on the current CPU -
I am doing this for 4 threads , on Intel & ARM cpus.
Reading performance counters does involve 2 ioctls and a read() ,
which takes time that already far exceeds the time required to read
the TSC or CNTPCT in the VDSO .
The CPU_CLOCK software counter should give the converted TSC cycles
seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
difference between the event->time_running and time_enabled
should also measure elapsed time .
This gives the "inner" elapsed time, from the perpective of the kernel,
while the measured code section had the counters enabled.
But unless the user-space program also has a way of measuring elapsed time
from the CPU's perspective , ie. without being subject to operator or NTP / PTP
adjustment, it has no way of correlating this inner elapsed time with
any "outer"
elapsed time measurement it may have made - I also measure the time
taken by I/O operations between threads, for instance.
So that is my primary motivation - for each thread's main run loop, I
enable performance counters and count several PMU counters
and the CPU_CLOCK & TASK_CLOCK . I want to determine
with maximal accuracy how much elapsed time was used
actually executing the task's instructions on the CPU ,
and how long they took to execute.
I want to try to exclude the time spent gathering and making
and analysing the performance measurements from the
time spent running the threads' main loop .
To do this accurately, it is best to exclude variations in time
that occur because of operator or NTP / PTP adjustments .
The CLOCK_MONOTONIC_RAW clock is the ONLY
clock that is MEANT to be immune from any adjustment.
It is meant to be high - resolution clock with 1ns resolution
that should be subject to no adjustment, and hence one would expect
it it have the lowest latency.
But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW
has a resolution (minimum time that can be measured)
that varies from 300 - 1000ns .
I can read the TSC and store a 16-byte timespec value in @ 8ns
on the same CPU .
I understand that linux must conform to the POSIX interface which
means it cannot provide sub-nanosecond resolution timers, but
it could allow user-space programs to easily discover the timer calibration
so that user-space programs can read the timers themselves.
Currently, users must parse the log file or use gdb / objdump to
inspect /proc/kcore to get the TSC calibration and exact
mult+shift values for the TSC value conversion.
Intel does not publish, nor does the CPU come with in ROM or firmware,
the actual precise TSC frequency - this must be calibrated against the
other clocks , according to a complicated procedure in section 18.2 of
the SDM . My TSC has a "rated" / nominal TSC frequency , which one
can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
is 2.8333ghz .
Hence I think Linux should export this calibrated frequency somehow ;
its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
values, and is exported to the VDSO .
I think the VDSO should read the TSC and use the calibration
to render the raw, unadjusted time from the CPU's perspective.
Hence, the patch I am preparing , which is again attached.
I will submit it properly via email once I figure out
how to obtain the 'git-send-mail' tool, and how to
use it to send multiple patches, which seems
to be the only way to submit acceptable patches.
Also the attached timer program measures a latency
of @ 20ns with my patch 4.15.9 kernel, when it
measured a latency of 300-1000ns without it.
Thanks & Regards,
Jason
Attachment:
vdso_clock_monotonic_raw_1.patch
Description: Binary data
/*
* Program to measure high-res timer latency.
*
*/
#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h>
#include <unistd.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
#ifndef N_SAMPLES
#define N_SAMPLES 100
#endif
#define _STR(_S_) #_S_
#define STR(_S_) _STR(_S_)
int main(int argc, char *const* argv, char *const* envp)
{ clockid_t clk = CLOCK_MONOTONIC_RAW;
bool do_dump = false;
int argn=1;
for(; argn < argc; argn+=1)
if( argv[argn] != NULL )
if( *(argv[argn]) == '-')
switch( *(argv[argn]+1) )
{ case 'm':
case 'M':
clk = CLOCK_MONOTONIC;
break;
case 'd':
case 'D':
do_dump = true;
break;
case '?':
case 'h':
case 'u':
case 'U':
case 'H':
fprintf(stderr,"Usage: timer_latency [ -m : use CLOCK_MONOTONIC clock (not CLOCK_MONOTONIC_RAW) ; -d : dump timespec contents. N_SAMPLES: " STR(N_SAMPLES) "\n\t"
"Calculates average timer latency (minimum time that can be measured) over N_SAMPLES.\n"
);
return 0;
}
struct timespec sample[N_SAMPLES+1];
unsigned int cnt=N_SAMPLES, s=0 ;
do
{ if( 0 != clock_gettime(clk, &sample[s++]) )
{ fprintf(stderr,"oops, clock_gettime() failed: %d: '%s'.\n", errno, strerror(errno));
return 1;
}
}while( --cnt );
clock_gettime(clk, &sample[s]);
#define TS2NS(_TS_) ((((unsigned long long)(_TS_).tv_sec)*1000000000ULL) + (((unsigned long long)((_TS_).tv_nsec))))
unsigned long long
deltas [ N_SAMPLES ]
, t1, t2, sum=0, zd=0
, t_start = TS2NS(sample[0]);
for(s=1; s < (N_SAMPLES+1); s+=1)
{ t1 = TS2NS(sample[s-1]);
t2 = TS2NS(sample[s]);
if ( t1 > t2 )
{ fprintf(stderr,"Inconsistency: %llu %llu %lu.%lu %lu.%lu\n", t1 , t2
, sample[s-1].tv_sec, sample[s-1].tv_nsec
, sample[s].tv_sec, sample[s].tv_nsec
);
continue;
}
unsigned long long d =t2-t1;
if ( d == 0 )
{ if( zd == 0 )
fprintf(stderr, "0 delta!\n");
zd += 1;
}
deltas[s-1] = d;
}
if( zd > 0 )
fprintf(stderr,"%u 0 deltas\n");
for(s = 0; s < N_SAMPLES; s+=1)
sum += deltas[s];
fprintf(stderr,"sum: %llu\n",sum);
unsigned long long avg_ns = sum / N_SAMPLES;
t1=(t2 - t_start);
printf("Total time: %1.1llu.%9.9lluS - Average Latency: %1.1llu.%9.9lluS\n",
t1 / 1000000000, t1 % 1000000000,
avg_ns / 1000000000, avg_ns % 1000000000
);
return 0;
}