Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

From: Jason Vas Dias
Date: Wed Feb 22 2017 - 15:18:22 EST


I actually tried adding a 'notsc_adjust' kernel option to disable any setting or
access to the TSC_ADJUST MSR, but then I see the problems - a big disparity
in values depending on which CPU the thread is scheduled - and no
improvement in clock_gettime() latency. So I don't think the new
TSC_ADJUST
code in ts_sync.c itself is the issue - but something added @ 460ns
onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
As I don't think fixing the clock_gettime() latency issue is my problem or even
possible with current clock architecture approach, it is a non-issue.

But please, can anyone tell me if are there any plans to move the time
infrastructure out of the kernel and into glibc along the lines
outlined
in previous mail - if not, I am going to concentrate on this more radical
overhaul approach for my own systems .

At least, I think mapping the clocksource information structure itself in some
kind of sharable page makes sense . Processes could map that page copy-on-write
so they could start off with all the timing parameters preloaded, then keep
their copy updated using the rdtscp instruction , or msync() (read-only)
with the kernel's single copy to get the latest time any process has requested.
All real-time parameters & adjustments could be stored in that page ,
& eventually a single copy of the tzdata could be used by both kernel
& user-space.
That is what I am working towards. Any plans to make linux real-time tsc
clock user-friendly ?



On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
> read or written . It is probably because it genuinuely does not
> support any cpuid > 13 ,
> or the modern TSC_ADJUST interface . This is probably why my
> clock_gettime()
> latencies are so bad. Now I have to develop a patch to disable all access
> to
> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
> I really have an unlucky CPU :-) .
>
> But really, I think this issue goes deeper into the fundamental limits of
> time measurement on Linux : it is never going to be possible to measure
> minimum times with clock_gettime() comparable with those returned by
> rdtscp instruction - the time taken to enter the kernel through the VDSO,
> queue an access to vsyscall_gtod_data via a workqueue, access it & do
> computations & copy value to user-space is NEVER going to be up to the
> job of measuring small real-time durations of the order of 10-20 TSC ticks
> .
>
> I think the best way to solve this problem going forward would be to store
> the entire vsyscall_gtod_data data structure representing the current
> clocksource
> in a shared page which is memory-mappable (read-only) by user-space .
> I think sser-space programs should be able to do something like :
> int fd =
> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
> size_t psz = getpagesize();
> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
> msync(gtod,psz,MS_SYNC);
>
> Then they could all read the real-time clock values as they are updated
> in real-time by the kernel, and know exactly how to interpret them .
>
> I also think that all mktime() / gmtime() / localtime() timezone handling
> functionality should be
> moved to user-space, and that the kernel should actually load and link in
> some
> /lib/libtzdata.so
> library, provided by glibc / libc implementations, that is exactly the
> same library
> used by glibc() code to parse tzdata ; tzdata should be loaded at boot time
> by the kernel from the same places glibc loads it, and both the kernel and
> glibc should use identical mktime(), gmtime(), etc. functions to access it,
> and
> glibc using code would not need to enter the kernel at all for any
> time-handling
> code. This tzdata-library code be automatically loaded into process images
> the
> same way the vdso region is , and the whole system could access only one
> copy of it and the 'gtod.page' in memory.
>
> That's just my two-cents worth, and how I'd like to eventually get
> things working
> on my system.
>
> All the best, Regards,
> Jason
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>>> RE:
>>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR.
>>>
>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>
>>> I have attached an updated version of the test program which
>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>> version printed it, but equally ignored it).
>>>
>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>>> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! :
>>>
>>> $ uname -r
>>> 4.10.0
>>> $ ./ttsc1
>>> max_extended_leaf: 80000008
>>> has tsc: 1 constant: 1
>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>>> ts3 - ts2: 178 ns1: 0.000000592
>>> ts3 - ts2: 14 ns1: 0.000000577
>>> ts3 - ts2: 14 ns1: 0.000000651
>>> ts3 - ts2: 17 ns1: 0.000000625
>>> ts3 - ts2: 17 ns1: 0.000000677
>>> ts3 - ts2: 17 ns1: 0.000000626
>>> ts3 - ts2: 17 ns1: 0.000000627
>>> ts3 - ts2: 17 ns1: 0.000000627
>>> ts3 - ts2: 18 ns1: 0.000000655
>>> ts3 - ts2: 17 ns1: 0.000000631
>>> t1 - t0: 89067 - ns2: 0.000091411
>>>
>>
>>
>> Oops, going blind in my old age. These latencies are actually 3 times
>> greater than under 4.8 !!
>>
>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as
>> shown
>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>>
>> ts3 - ts2: 24 ns1: 0.000000162
>> ts3 - ts2: 17 ns1: 0.000000143
>> ts3 - ts2: 17 ns1: 0.000000146
>> ts3 - ts2: 17 ns1: 0.000000149
>> ts3 - ts2: 17 ns1: 0.000000141
>> ts3 - ts2: 16 ns1: 0.000000142
>>
>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
>> 600ns, @ 4 times more than under 4.8 .
>> But I'm glad the TSC_ADJUST problems are fixed.
>>
>> Will programs reading :
>> $ cat /sys/devices/msr/events/tsc
>> event=0x00
>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
>> TSC ?
>>
>>> I think this is because under Linux 4.8, the CPU got a fault every
>>> time it read the TSC_ADJUST MSR.
>>
>> maybe it still is!
>>
>>
>>> But user programs wanting to use the TSC and correlate its value to
>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>>> program still have to dig the TSC frequency value out of the kernel
>>> with objdump - this was really the point of the bug #194609.
>>>
>>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>>> 'shift' values via sysfs.
>>>
>>> Regards,
>>> Jason.
>>>
>>>
>>>
>>>
>>>
>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>>>> Thank You for enlightening me -
>>>>
>>>> I was just having a hard time believing that Intel would ship a chip
>>>> that features a monotonic, fixed frequency timestamp counter
>>>> without specifying in either documentation or on-chip or in ACPI what
>>>> precisely that hard-wired frequency is, but I now know that to
>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>>> difficult to reconcile with the statement in the SDM :
>>>> 17.16.4 Invariant Time-Keeping
>>>> The invariant TSC is based on the invariant timekeeping hardware
>>>> (called Always Running Timer or ART), that runs at the core crystal
>>>> clock
>>>> frequency. The ratio defined by CPUID leaf 15H expresses the
>>>> frequency
>>>> relationship between the ART hardware and TSC. If
>>>> CPUID.15H:EBX[31:0]
>>>> !=
>>>> 0
>>>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>>>> relationship holds between TSC and the ART hardware:
>>>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>> / CPUID.15H:EAX[31:0] + K
>>>> Where 'K' is an offset that can be adjusted by a privileged
>>>> agent*2.
>>>> When ART hardware is reset, both invariant TSC and K are also
>>>> reset.
>>>>
>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and
>>>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly)
>>>> that
>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>>> CPUs with InvariantTSC .
>>>>
>>>> Do I understand correctly , that since I do have InvariantTSC , the
>>>> TSC_Value is in fact calculated according to the above formula, but
>>>> with
>>>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to
>>>> TSC frequency ?
>>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>>> supported somehow , because I thought InvariantTSC meant it had ART
>>>> hardware .
>>>>
>>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>>> somewhere to user
>>>> space .
>>>>
>>>> I think the best long-term solution would be to allow programs to
>>>> somehow read the TSC without invoking
>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>>> system
>>>> .
>>>>
>>>>
>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>>> 'clocksource->shift' values to /sysfs somehow ?
>>>>
>>>> For instance , only if the 'current_clocksource' is 'tsc', then these
>>>> values could be exported as:
>>>> /sys/devices/system/clocksource/clocksource0/shift
>>>> /sys/devices/system/clocksource/clocksource0/mult
>>>> /sys/devices/system/clocksource/clocksource0/freq
>>>>
>>>> So user-space programs could know that the value returned by
>>>> clock_gettime(CLOCK_MONOTONIC_RAW)
>>>> would be
>>>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>> }
>>>> and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>>
>>>> That would save user-space programs from having to know 'tsc_khz' by
>>>> parsing the 'Refined TSC' frequency from log files or by examining the
>>>> running kernel with objdump to obtain this value & figure out 'mult' &
>>>> 'shift' themselves.
>>>>
>>>> And why not a
>>>> /sys/devices/system/clocksource/clocksource0/value
>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>>> expression as a long integer?
>>>> And perhaps a
>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>>> file that actually prints out the number of real-time nano-seconds
>>>> since
>>>> the
>>>> contents of the existing
>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>>> files using the current TSC value?
>>>> To read the rtc0/{date,time} files is already faster than entering the
>>>> kernel to call
>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>>
>>>> I will work on developing a patch to this effect if no-one else is.
>>>>
>>>> Also, am I right in assuming that the maximum granularity of the
>>>> real-time
>>>> clock
>>>> on my system is 1/64th of a second ? :
>>>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>> 64
>>>> This is the maximum granularity that can be stored in CMOS , not
>>>> returned by TSC? Couldn't we have something similar that gave an
>>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>>> value to get clock_gettime
>>>> (CLOCK_MONOTONIC_RAW) value ?
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>> This code does produce good timestamps with a latency of @20ns
>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>>> values, but it depends on a global variable that is initialized to
>>>> the 'tsc_khz' value
>>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t
>>>> IA64_tsc_now()
>>>> { if(!( _ia64_invariant_tsc_enabled
>>>> ||(( _cpu0id_fd == -1) &&
>>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>> )
>>>> )
>>>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>>> TSC enabled.\n");
>>>> return 0;
>>>> }
>>>> U32_t tsc_hi, tsc_lo;
>>>> register UL_t tsc;
>>>> asm volatile
>>>> ( "rdtscp\n\t"
>>>> "mov %%edx, %0\n\t"
>>>> "mov %%eax, %1\n\t"
>>>> "mov %%ecx, %2\n\t"
>>>> : "=m" (tsc_hi) ,
>>>> "=m" (tsc_lo) ,
>>>> "=m" (_ia64_tsc_user_cpu) :
>>>> : "%eax","%ecx","%edx"
>>>> );
>>>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>> return tsc;
>>>> }
>>>>
>>>> __thread
>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t IA64_tsc_ticks_since_start()
>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>>> { _ia64_first_tsc = IA64_tsc_now();
>>>> return 0;
>>>> }
>>>> return (IA64_tsc_now() - _ia64_first_tsc) ;
>>>> }
>>>>
>>>> static inline __attribute__((always_inline))
>>>> void
>>>> ia64_tsc_calc_mult_shift
>>>> ( register U32_t *mult,
>>>> register U32_t *shift
>>>> )
>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>>> function:
>>>> * calculates second + nanosecond mult + shift in same way linux
>>>> does.
>>>> * we want to be compatible with what linux returns in struct
>>>> timespec ts after call to
>>>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>> */
>>>> const U32_t scale=1000U;
>>>> register U32_t from= IA64_tsc_khz();
>>>> register U32_t to = NSEC_PER_SEC / scale;
>>>> register U64_t sec = ( ~0UL / from ) / scale;
>>>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>> register U64_t maxsec = sec * scale;
>>>> UL_t tmp;
>>>> U32_t sft, sftacc=32;
>>>> /*
>>>> * Calculate the shift factor which is limiting the conversion
>>>> * range:
>>>> */
>>>> tmp = (maxsec * from) >> 32;
>>>> while (tmp)
>>>> { tmp >>=1;
>>>> sftacc--;
>>>> }
>>>> /*
>>>> * Find the conversion shift/mult pair which has the best
>>>> * accuracy and fits the maxsec conversion range:
>>>> */
>>>> for (sft = 32; sft > 0; sft--)
>>>> { tmp = ((UL_t) to) << sft;
>>>> tmp += from / 2;
>>>> tmp = tmp / from;
>>>> if ((tmp >> sftacc) == 0)
>>>> break;
>>>> }
>>>> *mult = tmp;
>>>> *shift = sft;
>>>> }
>>>>
>>>> __thread
>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>>
>>>> static inline __attribute__((always_inline))
>>>> U64_t IA64_s_ns_since_start()
>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>> register U64_t cycles = IA64_tsc_ticks_since_start();
>>>> register U64_t ns = ((cycles
>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>> /* Yes, we are purposefully ignoring durations of more than 4.2
>>>> billion seconds here! */
>>>> }
>>>>
>>>>
>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>>> somehow,
>>>> then user-space libraries could have more confidence in using 'rdtsc'
>>>> or 'rdtscp'
>>>> if Linux's current_clocksource is 'tsc'.
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>>
>>>>
>>>> On 20/02/2017, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>>
>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>>>> in detect_art() in tsc.c,
>>>>>
>>>>> By some definition of available. You can feed CPUID random leaf
>>>>> numbers
>>>>> and
>>>>> it will return something, usually the value of the last valid CPUID
>>>>> leaf,
>>>>> which is 13 on your CPU. A similar CPU model has
>>>>>
>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>>> edx=0x00000000
>>>>>
>>>>> i.e. 7, 832, 832, 0
>>>>>
>>>>> Looks familiar, right?
>>>>>
>>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>>
>>>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>>>> CPUID +
>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>>> see this bit set .
>>>>>
>>>>> Rightfully so. This is a Haswell Core model.
>>>>>
>>>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>>>
>>>>> PTP is independent of the ART kernel feature . ART just provides
>>>>> enhanced
>>>>> PTP features. You are confusing things here.
>>>>>
>>>>> The ART feature as the kernel sees it is a hardware extension which
>>>>> feeds
>>>>> the ART clock to peripherals for timestamping and time correlation
>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf
>>>>> 0x15
>>>>> so
>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>>> accuracy.
>>>>>
>>>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>>>> of
>>>>> ART, but that has nothing to do with the feature bit, which solely
>>>>> describes the ratio between TSC and the ART frequency which is exposed
>>>>> to
>>>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>>>
>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to
>>>>>> be
>>>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is
>>>>>> 0
>>>>>> because the CPU will always get a fault reading the MSR since it has
>>>>>> never been written.
>>>>>
>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>>> really
>>>>> wrong. And writing it unconditionally to 0 is not going to happen.
>>>>> 4.10
>>>>> has
>>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>>
>>>>>> It would be nice for user-space programs that want to use the TSC
>>>>>> with
>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>>>> bug report,
>>>>>> could have confidence that Linux is actually generating the results
>>>>>> of
>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>>> in a predictable way from the TSC by looking at the
>>>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>>> clock_gettime() values.
>>>>>
>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>>
>>>>> Nothing at all, really.
>>>>>
>>>>> The kernel makes use of the proper information values already.
>>>>>
>>>>> The TSC frequency is determined from:
>>>>>
>>>>> 1) CPUID(0x16) if available
>>>>> 2) MSRs if available
>>>>> 3) By calibration against a known clock
>>>>>
>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_*
>>>>> values
>>>>> are
>>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>>
>>>>>> has tsc: 1 constant: 1
>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>>
>>>>> And that voodoo math tells us what? That you found a way to correlate
>>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>>
>>>>> Now I'm curious how you do that on this other machine which returns
>>>>> for
>>>>> cpuid(15): 1, 1, 1
>>>>>
>>>>> You can't because all of this is completely wrong.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> tglx
>>>>>
>>>>
>>>
>>
>