Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

From: Jason Vas Dias
Date: Wed Feb 22 2017 - 11:26:51 EST


On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
> RE:
>>> 4.10 has new code which utilizes the TSC_ADJUST MSR.
>
> I just built an unpatched linux v4.10 with tglx's TSC improvements -
> much else improved in this kernel (like iwlwifi) - thanks!
>
> I have attached an updated version of the test program which
> doesn't print the bogus "Nominal TSC Frequency" (the previous
> version printed it, but equally ignored it).
>
> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! :
>
> $ uname -r
> 4.10.0
> $ ./ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
> ts3 - ts2: 178 ns1: 0.000000592
> ts3 - ts2: 14 ns1: 0.000000577
> ts3 - ts2: 14 ns1: 0.000000651
> ts3 - ts2: 17 ns1: 0.000000625
> ts3 - ts2: 17 ns1: 0.000000677
> ts3 - ts2: 17 ns1: 0.000000626
> ts3 - ts2: 17 ns1: 0.000000627
> ts3 - ts2: 17 ns1: 0.000000627
> ts3 - ts2: 18 ns1: 0.000000655
> ts3 - ts2: 17 ns1: 0.000000631
> t1 - t0: 89067 - ns2: 0.000091411
>


Oops, going blind in my old age. These latencies are actually 3 times
greater than under 4.8 !!

Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown
in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::

ts3 - ts2: 24 ns1: 0.000000162
ts3 - ts2: 17 ns1: 0.000000143
ts3 - ts2: 17 ns1: 0.000000146
ts3 - ts2: 17 ns1: 0.000000149
ts3 - ts2: 17 ns1: 0.000000141
ts3 - ts2: 16 ns1: 0.000000142

now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
600ns, @ 4 times more than under 4.8 .
But I'm glad the TSC_ADJUST problems are fixed.

Will programs reading :
$ cat /sys/devices/msr/events/tsc
event=0x00
read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the
TSC ?

> I think this is because under Linux 4.8, the CPU got a fault every
> time it read the TSC_ADJUST MSR.

maybe it still is!


> But user programs wanting to use the TSC and correlate its value to
> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
> program still have to dig the TSC frequency value out of the kernel
> with objdump - this was really the point of the bug #194609.
>
> I would still like to investigate exporting 'tsc_khz' & 'mult' +
> 'shift' values via sysfs.
>
> Regards,
> Jason.
>
>
>
>
>
> On 21/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>> Thank You for enlightening me -
>>
>> I was just having a hard time believing that Intel would ship a chip
>> that features a monotonic, fixed frequency timestamp counter
>> without specifying in either documentation or on-chip or in ACPI what
>> precisely that hard-wired frequency is, but I now know that to
>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>> difficult to reconcile with the statement in the SDM :
>> 17.16.4 Invariant Time-Keeping
>> The invariant TSC is based on the invariant timekeeping hardware
>> (called Always Running Timer or ART), that runs at the core crystal
>> clock
>> frequency. The ratio defined by CPUID leaf 15H expresses the
>> frequency
>> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0]
>> !=
>> 0
>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity
>> relationship holds between TSC and the ART hardware:
>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>> / CPUID.15H:EAX[31:0] + K
>> Where 'K' is an offset that can be adjusted by a privileged agent*2.
>> When ART hardware is reset, both invariant TSC and K are also reset.
>>
>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and
>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly)
>> that
>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>> CPUs with InvariantTSC .
>>
>> Do I understand correctly , that since I do have InvariantTSC , the
>> TSC_Value is in fact calculated according to the above formula, but with
>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to
>> TSC frequency ?
>> It was obvious this nominal TSC Frequency had nothing to do with the
>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>> I guess wishful thinking led me to believe CPUID:15h was actually
>> supported somehow , because I thought InvariantTSC meant it had ART
>> hardware .
>>
>> I do strongly suggest that Linux exports its calibrated TSC Khz
>> somewhere to user
>> space .
>>
>> I think the best long-term solution would be to allow programs to
>> somehow read the TSC without invoking
>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>> having to enter the kernel, which incurs an overhead of > 120ns on my
>> system
>> .
>>
>>
>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>> 'clocksource->shift' values to /sysfs somehow ?
>>
>> For instance , only if the 'current_clocksource' is 'tsc', then these
>> values could be exported as:
>> /sys/devices/system/clocksource/clocksource0/shift
>> /sys/devices/system/clocksource/clocksource0/mult
>> /sys/devices/system/clocksource/clocksource0/freq
>>
>> So user-space programs could know that the value returned by
>> clock_gettime(CLOCK_MONOTONIC_RAW)
>> would be
>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32,
>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>> }
>> and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>
>> That would save user-space programs from having to know 'tsc_khz' by
>> parsing the 'Refined TSC' frequency from log files or by examining the
>> running kernel with objdump to obtain this value & figure out 'mult' &
>> 'shift' themselves.
>>
>> And why not a
>> /sys/devices/system/clocksource/clocksource0/value
>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>> expression as a long integer?
>> And perhaps a
>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>> file that actually prints out the number of real-time nano-seconds since
>> the
>> contents of the existing
>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>> files using the current TSC value?
>> To read the rtc0/{date,time} files is already faster than entering the
>> kernel to call
>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>
>> I will work on developing a patch to this effect if no-one else is.
>>
>> Also, am I right in assuming that the maximum granularity of the
>> real-time
>> clock
>> on my system is 1/64th of a second ? :
>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>> 64
>> This is the maximum granularity that can be stored in CMOS , not
>> returned by TSC? Couldn't we have something similar that gave an
>> accurate idea of TSC frequency and the precise formula applied to TSC
>> value to get clock_gettime
>> (CLOCK_MONOTONIC_RAW) value ?
>>
>> Regards,
>> Jason
>>
>>
>> This code does produce good timestamps with a latency of @20ns
>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>> values, but it depends on a global variable that is initialized to
>> the 'tsc_khz' value
>> computed by running kernel parsed from objdump /proc/kcore output :
>>
>> static inline __attribute__((always_inline))
>> U64_t
>> IA64_tsc_now()
>> { if(!( _ia64_invariant_tsc_enabled
>> ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL))
>> )
>> )
>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>> TSC enabled.\n");
>> return 0;
>> }
>> U32_t tsc_hi, tsc_lo;
>> register UL_t tsc;
>> asm volatile
>> ( "rdtscp\n\t"
>> "mov %%edx, %0\n\t"
>> "mov %%eax, %1\n\t"
>> "mov %%ecx, %2\n\t"
>> : "=m" (tsc_hi) ,
>> "=m" (tsc_lo) ,
>> "=m" (_ia64_tsc_user_cpu) :
>> : "%eax","%ecx","%edx"
>> );
>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>> return tsc;
>> }
>>
>> __thread
>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>
>> static inline __attribute__((always_inline))
>> U64_t IA64_tsc_ticks_since_start()
>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>> { _ia64_first_tsc = IA64_tsc_now();
>> return 0;
>> }
>> return (IA64_tsc_now() - _ia64_first_tsc) ;
>> }
>>
>> static inline __attribute__((always_inline))
>> void
>> ia64_tsc_calc_mult_shift
>> ( register U32_t *mult,
>> register U32_t *shift
>> )
>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function:
>> * calculates second + nanosecond mult + shift in same way linux does.
>> * we want to be compatible with what linux returns in struct
>> timespec ts after call to
>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>> */
>> const U32_t scale=1000U;
>> register U32_t from= IA64_tsc_khz();
>> register U32_t to = NSEC_PER_SEC / scale;
>> register U64_t sec = ( ~0UL / from ) / scale;
>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>> register U64_t maxsec = sec * scale;
>> UL_t tmp;
>> U32_t sft, sftacc=32;
>> /*
>> * Calculate the shift factor which is limiting the conversion
>> * range:
>> */
>> tmp = (maxsec * from) >> 32;
>> while (tmp)
>> { tmp >>=1;
>> sftacc--;
>> }
>> /*
>> * Find the conversion shift/mult pair which has the best
>> * accuracy and fits the maxsec conversion range:
>> */
>> for (sft = 32; sft > 0; sft--)
>> { tmp = ((UL_t) to) << sft;
>> tmp += from / 2;
>> tmp = tmp / from;
>> if ((tmp >> sftacc) == 0)
>> break;
>> }
>> *mult = tmp;
>> *shift = sft;
>> }
>>
>> __thread
>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>
>> static inline __attribute__((always_inline))
>> U64_t IA64_s_ns_since_start()
>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>> register U64_t cycles = IA64_tsc_ticks_since_start();
>> register U64_t ns = ((cycles
>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>> NSEC_PER_SEC)&0x3fffffffUL) );
>> /* Yes, we are purposefully ignoring durations of more than 4.2
>> billion seconds here! */
>> }
>>
>>
>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>> somehow,
>> then user-space libraries could have more confidence in using 'rdtsc'
>> or 'rdtscp'
>> if Linux's current_clocksource is 'tsc'.
>>
>> Regards,
>> Jason
>>
>>
>>
>> On 20/02/2017, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>
>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so
>>>> in detect_art() in tsc.c,
>>>
>>> By some definition of available. You can feed CPUID random leaf numbers
>>> and
>>> it will return something, usually the value of the last valid CPUID
>>> leaf,
>>> which is 13 on your CPU. A similar CPU model has
>>>
>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>> edx=0x00000000
>>>
>>> i.e. 7, 832, 832, 0
>>>
>>> Looks familiar, right?
>>>
>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>
>>>> Linux does not think ART is enabled, and does not set the synthesized
>>>> CPUID +
>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>> see this bit set .
>>>
>>> Rightfully so. This is a Haswell Core model.
>>>
>>>> if an e1000 NIC card had been installed, PTP would not be available.
>>>
>>> PTP is independent of the ART kernel feature . ART just provides
>>> enhanced
>>> PTP features. You are confusing things here.
>>>
>>> The ART feature as the kernel sees it is a hardware extension which
>>> feeds
>>> the ART clock to peripherals for timestamping and time correlation
>>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15
>>> so
>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>> accuracy.
>>>
>>> It's correct, that the NONSTOP_TSC feature depends on the availability
>>> of
>>> ART, but that has nothing to do with the feature bit, which solely
>>> describes the ratio between TSC and the ART frequency which is exposed
>>> to
>>> peripherals. That frequency is not necessarily the real ART frequency.
>>>
>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be
>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0
>>>> because the CPU will always get a fault reading the MSR since it has
>>>> never been written.
>>>
>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>> really
>>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10
>>> has
>>> new code which utilizes the TSC_ADJUST MSR.
>>>
>>>> It would be nice for user-space programs that want to use the TSC with
>>>> rdtsc / rdtscp instructions, such as the demo program attached to the
>>>> bug report,
>>>> could have confidence that Linux is actually generating the results of
>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>> in a predictable way from the TSC by looking at the
>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>> use of TSC values, so that they can correlate TSC values with linux
>>>> clock_gettime() values.
>>>
>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>
>>> Nothing at all, really.
>>>
>>> The kernel makes use of the proper information values already.
>>>
>>> The TSC frequency is determined from:
>>>
>>> 1) CPUID(0x16) if available
>>> 2) MSRs if available
>>> 3) By calibration against a known clock
>>>
>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values
>>> are
>>> correct whether that machine has ART exposed to peripherals or not.
>>>
>>>> has tsc: 1 constant: 1
>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>
>>> And that voodoo math tells us what? That you found a way to correlate
>>> CPUID(0xd) to the TSC frequency on that machine.
>>>
>>> Now I'm curious how you do that on this other machine which returns for
>>> cpuid(15): 1, 1, 1
>>>
>>> You can't because all of this is completely wrong.
>>>
>>> Thanks,
>>>
>>> tglx
>>>
>>
>