Re: [PATCH] arch/x86/kernel/tsc.c : set X86_FEATURE_ART for TSC on CPUs like i7-4910MQ : bug #194609

From: Jason Vas Dias
Date: Thu Feb 23 2017 - 13:05:25 EST


I have found a new source of weirdness with TSC using
clock_gettime(CLOCK_MONOTONIC_RAW,&ts) :

The vsyscall_gtod_data.mult field changes somewhat between
calls to clock_gettime(CLOCK_MONOTONIC_RAW,&ts),
so that sometimes an extra (2^24) nanoseconds are added or
removed from the value derived from the TSC and stored in 'ts' .

This is demonstrated by the output of the test program in the
attached ttsc.tar file:
$ ./tlgtd
it worked! - GTOD: clock:1 mult:5798662 shift:24
synced - mult now: 5798661

What it is doing is finding the address of the 'vsyscall_gtod_data' structure
from /proc/kallsyms, and mapping the virtual address to an ELF section
offset within /proc/kcore, and reading just the 'vsyscall_gtod_data' structure
into user-space memory .

Really, this 'mult' value, which is used to return the
seconds|nanoseconds value:
( tsc_cycles * mult ) >> shift
(where shift is 24 ), should not change from the first time it is initialized .

The TSC is meant to be FIXED FREQUENCY, right ?
So how could / why should the conversion function from TSC ticks to
nanoseconds change ?

So now it is doubly difficult for user-space libraries to maintain their
RDTSC derived seconds|nanoseconds values to correlate well those returned by
the kernel, because they must regularly read the updated 'mult' value
used by the
kernel .

I really don't think the kernel should randomly be deciding to
increase / decrease
the TSC tick period by 2^24 nanoseconds!

Is this a bug or intentional ? I am searching for all places where a
'[.>]mult.*=' occurs, but this returns rather alot of matches.

Please could a future version of linux at least export the 'mult' and
'shift' values for
the current clocksource !

Regards,
Jason








On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
> OK, last post on this issue today -
> can anyone explain why, with standard 4.10.0 kernel & no new
> 'notsc_adjust' option, and the same maths being used, these two runs
> should display
> such a wide disparity between clock_gettime(CLOCK_MONOTONIC_RAW,&ts)
> values ? :
>
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 162 ts3 - ts2: 110 ns1: 0.000000641 ns2: 0.000002850
> ts3 - ts2: 175 ns1: 0.000000659
> ts3 - ts2: 18 ns1: 0.000000643
> ts3 - ts2: 18 ns1: 0.000000618
> ts3 - ts2: 17 ns1: 0.000000620
> ts3 - ts2: 17 ns1: 0.000000616
> ts3 - ts2: 18 ns1: 0.000000641
> ts3 - ts2: 18 ns1: 0.000000709
> ts3 - ts2: 20 ns1: 0.000000763
> ts3 - ts2: 20 ns1: 0.000000735
> ts3 - ts2: 20 ns1: 0.000000761
> t1 - t0: 78200 - ns2: 0.000080824
> $ J/pub/ttsc/ttsc1
> max_extended_leaf: 80000008
> has tsc: 1 constant: 1
> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz - TSC adjust: 1.
> ts2 - ts1: 217 ts3 - ts2: 221 ns1: 0.000001294 ns2: 0.000005375
> ts3 - ts2: 210 ns1: 0.000001418
> ts3 - ts2: 23 ns1: 0.000001399
> ts3 - ts2: 22 ns1: 0.000001445
> ts3 - ts2: 25 ns1: 0.000001321
> ts3 - ts2: 20 ns1: 0.000001428
> ts3 - ts2: 25 ns1: 0.000001367
> ts3 - ts2: 23 ns1: 0.000001425
> ts3 - ts2: 23 ns1: 0.000001357
> ts3 - ts2: 22 ns1: 0.000001487
> ts3 - ts2: 25 ns1: 0.000001377
> t1 - t0: 145753 - ns2: 0.000150781
>
> (complete source of test program ttsc1 attached in ttsc.tar
> $ tar -xpf ttsc.tar
> $ cd ttsc
> $ make
> ).
>
> On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>> I actually tried adding a 'notsc_adjust' kernel option to disable any
>> setting or
>> access to the TSC_ADJUST MSR, but then I see the problems - a big
>> disparity
>> in values depending on which CPU the thread is scheduled - and no
>> improvement in clock_gettime() latency. So I don't think the new
>> TSC_ADJUST
>> code in ts_sync.c itself is the issue - but something added @ 460ns
>> onto every clock_gettime() call when moving from v4.8.0 -> v4.10.0 .
>> As I don't think fixing the clock_gettime() latency issue is my problem
>> or
>> even
>> possible with current clock architecture approach, it is a non-issue.
>>
>> But please, can anyone tell me if are there any plans to move the time
>> infrastructure out of the kernel and into glibc along the lines
>> outlined
>> in previous mail - if not, I am going to concentrate on this more radical
>> overhaul approach for my own systems .
>>
>> At least, I think mapping the clocksource information structure itself in
>> some
>> kind of sharable page makes sense . Processes could map that page
>> copy-on-write
>> so they could start off with all the timing parameters preloaded, then
>> keep
>> their copy updated using the rdtscp instruction , or msync() (read-only)
>> with the kernel's single copy to get the latest time any process has
>> requested.
>> All real-time parameters & adjustments could be stored in that page ,
>> & eventually a single copy of the tzdata could be used by both kernel
>> & user-space.
>> That is what I am working towards. Any plans to make linux real-time tsc
>> clock user-friendly ?
>>
>>
>>
>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>>> Yes, my CPU is still getting a fault every time the TSC_ADJUST MSR is
>>> read or written . It is probably because it genuinuely does not
>>> support any cpuid > 13 ,
>>> or the modern TSC_ADJUST interface . This is probably why my
>>> clock_gettime()
>>> latencies are so bad. Now I have to develop a patch to disable all
>>> access
>>> to
>>> TSC_ADJUST MSR if boot_cpu_data.cpuid_level <= 13 .
>>> I really have an unlucky CPU :-) .
>>>
>>> But really, I think this issue goes deeper into the fundamental limits
>>> of
>>> time measurement on Linux : it is never going to be possible to measure
>>> minimum times with clock_gettime() comparable with those returned by
>>> rdtscp instruction - the time taken to enter the kernel through the
>>> VDSO,
>>> queue an access to vsyscall_gtod_data via a workqueue, access it & do
>>> computations & copy value to user-space is NEVER going to be up to the
>>> job of measuring small real-time durations of the order of 10-20 TSC
>>> ticks
>>> .
>>>
>>> I think the best way to solve this problem going forward would be to
>>> store
>>> the entire vsyscall_gtod_data data structure representing the current
>>> clocksource
>>> in a shared page which is memory-mappable (read-only) by user-space .
>>> I think sser-space programs should be able to do something like :
>>> int fd =
>>> open("/sys/devices/system/clocksource/clocksource0/gtod.page",O_RDONLY);
>>> size_t psz = getpagesize();
>>> void *gtod = mmap( 0, psz, PROT_READ, MAP_PRIVATE, fd, 0 );
>>> msync(gtod,psz,MS_SYNC);
>>>
>>> Then they could all read the real-time clock values as they are updated
>>> in real-time by the kernel, and know exactly how to interpret them .
>>>
>>> I also think that all mktime() / gmtime() / localtime() timezone
>>> handling
>>> functionality should be
>>> moved to user-space, and that the kernel should actually load and link
>>> in
>>> some
>>> /lib/libtzdata.so
>>> library, provided by glibc / libc implementations, that is exactly the
>>> same library
>>> used by glibc() code to parse tzdata ; tzdata should be loaded at boot
>>> time
>>> by the kernel from the same places glibc loads it, and both the kernel
>>> and
>>> glibc should use identical mktime(), gmtime(), etc. functions to access
>>> it,
>>> and
>>> glibc using code would not need to enter the kernel at all for any
>>> time-handling
>>> code. This tzdata-library code be automatically loaded into process
>>> images
>>> the
>>> same way the vdso region is , and the whole system could access only one
>>> copy of it and the 'gtod.page' in memory.
>>>
>>> That's just my two-cents worth, and how I'd like to eventually get
>>> things working
>>> on my system.
>>>
>>> All the best, Regards,
>>> Jason
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>>>> On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>>>>> RE:
>>>>>>> 4.10 has new code which utilizes the TSC_ADJUST MSR.
>>>>>
>>>>> I just built an unpatched linux v4.10 with tglx's TSC improvements -
>>>>> much else improved in this kernel (like iwlwifi) - thanks!
>>>>>
>>>>> I have attached an updated version of the test program which
>>>>> doesn't print the bogus "Nominal TSC Frequency" (the previous
>>>>> version printed it, but equally ignored it).
>>>>>
>>>>> The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by
>>>>> a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! :
>>>>>
>>>>> $ uname -r
>>>>> 4.10.0
>>>>> $ ./ttsc1
>>>>> max_extended_leaf: 80000008
>>>>> has tsc: 1 constant: 1
>>>>> Invariant TSC is enabled: Actual TSC freq: 2.893299GHz.
>>>>> ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599
>>>>> ts3 - ts2: 178 ns1: 0.000000592
>>>>> ts3 - ts2: 14 ns1: 0.000000577
>>>>> ts3 - ts2: 14 ns1: 0.000000651
>>>>> ts3 - ts2: 17 ns1: 0.000000625
>>>>> ts3 - ts2: 17 ns1: 0.000000677
>>>>> ts3 - ts2: 17 ns1: 0.000000626
>>>>> ts3 - ts2: 17 ns1: 0.000000627
>>>>> ts3 - ts2: 17 ns1: 0.000000627
>>>>> ts3 - ts2: 18 ns1: 0.000000655
>>>>> ts3 - ts2: 17 ns1: 0.000000631
>>>>> t1 - t0: 89067 - ns2: 0.000091411
>>>>>
>>>>
>>>>
>>>> Oops, going blind in my old age. These latencies are actually 3 times
>>>> greater than under 4.8 !!
>>>>
>>>> Under 4.8, the program printed latencies of @ 140ns for clock_gettime,
>>>> as
>>>> shown
>>>> in bug 194609 as the 'ns1' (timespec_b - timespec_a) value::
>>>>
>>>> ts3 - ts2: 24 ns1: 0.000000162
>>>> ts3 - ts2: 17 ns1: 0.000000143
>>>> ts3 - ts2: 17 ns1: 0.000000146
>>>> ts3 - ts2: 17 ns1: 0.000000149
>>>> ts3 - ts2: 17 ns1: 0.000000141
>>>> ts3 - ts2: 16 ns1: 0.000000142
>>>>
>>>> now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @
>>>> 600ns, @ 4 times more than under 4.8 .
>>>> But I'm glad the TSC_ADJUST problems are fixed.
>>>>
>>>> Will programs reading :
>>>> $ cat /sys/devices/msr/events/tsc
>>>> event=0x00
>>>> read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on
>>>> the
>>>> TSC ?
>>>>
>>>>> I think this is because under Linux 4.8, the CPU got a fault every
>>>>> time it read the TSC_ADJUST MSR.
>>>>
>>>> maybe it still is!
>>>>
>>>>
>>>>> But user programs wanting to use the TSC and correlate its value to
>>>>> clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above
>>>>> program still have to dig the TSC frequency value out of the kernel
>>>>> with objdump - this was really the point of the bug #194609.
>>>>>
>>>>> I would still like to investigate exporting 'tsc_khz' & 'mult' +
>>>>> 'shift' values via sysfs.
>>>>>
>>>>> Regards,
>>>>> Jason.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 21/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote:
>>>>>> Thank You for enlightening me -
>>>>>>
>>>>>> I was just having a hard time believing that Intel would ship a chip
>>>>>> that features a monotonic, fixed frequency timestamp counter
>>>>>> without specifying in either documentation or on-chip or in ACPI what
>>>>>> precisely that hard-wired frequency is, but I now know that to
>>>>>> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU
>>>>>> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is
>>>>>> difficult to reconcile with the statement in the SDM :
>>>>>> 17.16.4 Invariant Time-Keeping
>>>>>> The invariant TSC is based on the invariant timekeeping hardware
>>>>>> (called Always Running Timer or ART), that runs at the core
>>>>>> crystal
>>>>>> clock
>>>>>> frequency. The ratio defined by CPUID leaf 15H expresses the
>>>>>> frequency
>>>>>> relationship between the ART hardware and TSC. If
>>>>>> CPUID.15H:EBX[31:0]
>>>>>> !=
>>>>>> 0
>>>>>> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following
>>>>>> linearity
>>>>>> relationship holds between TSC and the ART hardware:
>>>>>> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )
>>>>>> / CPUID.15H:EAX[31:0] + K
>>>>>> Where 'K' is an offset that can be adjusted by a privileged
>>>>>> agent*2.
>>>>>> When ART hardware is reset, both invariant TSC and K are also
>>>>>> reset.
>>>>>>
>>>>>> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and
>>>>>> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly)
>>>>>> that
>>>>>> the "Nominal TSC Frequency" formulae in the manul must apply to all
>>>>>> CPUs with InvariantTSC .
>>>>>>
>>>>>> Do I understand correctly , that since I do have InvariantTSC , the
>>>>>> TSC_Value is in fact calculated according to the above formula, but
>>>>>> with
>>>>>> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to
>>>>>> TSC frequency ?
>>>>>> It was obvious this nominal TSC Frequency had nothing to do with the
>>>>>> actual TSC frequency used by Linux, which is 'tsc_khz' .
>>>>>> I guess wishful thinking led me to believe CPUID:15h was actually
>>>>>> supported somehow , because I thought InvariantTSC meant it had ART
>>>>>> hardware .
>>>>>>
>>>>>> I do strongly suggest that Linux exports its calibrated TSC Khz
>>>>>> somewhere to user
>>>>>> space .
>>>>>>
>>>>>> I think the best long-term solution would be to allow programs to
>>>>>> somehow read the TSC without invoking
>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), &
>>>>>> having to enter the kernel, which incurs an overhead of > 120ns on my
>>>>>> system
>>>>>> .
>>>>>>
>>>>>>
>>>>>> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and
>>>>>> 'clocksource->shift' values to /sysfs somehow ?
>>>>>>
>>>>>> For instance , only if the 'current_clocksource' is 'tsc', then
>>>>>> these
>>>>>> values could be exported as:
>>>>>> /sys/devices/system/clocksource/clocksource0/shift
>>>>>> /sys/devices/system/clocksource/clocksource0/mult
>>>>>> /sys/devices/system/clocksource/clocksource0/freq
>>>>>>
>>>>>> So user-space programs could know that the value returned by
>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW)
>>>>>> would be
>>>>>> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32,
>>>>>> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U
>>>>>> }
>>>>>> and that represents ticks of period (1.0 / ( freq * 1000 )) S.
>>>>>>
>>>>>> That would save user-space programs from having to know 'tsc_khz' by
>>>>>> parsing the 'Refined TSC' frequency from log files or by examining
>>>>>> the
>>>>>> running kernel with objdump to obtain this value & figure out 'mult'
>>>>>> &
>>>>>> 'shift' themselves.
>>>>>>
>>>>>> And why not a
>>>>>> /sys/devices/system/clocksource/clocksource0/value
>>>>>> file that actually prints this ( ( rdtsc() * mult ) >> shift )
>>>>>> expression as a long integer?
>>>>>> And perhaps a
>>>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds
>>>>>> file that actually prints out the number of real-time nano-seconds
>>>>>> since
>>>>>> the
>>>>>> contents of the existing
>>>>>> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch}
>>>>>> files using the current TSC value?
>>>>>> To read the rtc0/{date,time} files is already faster than entering
>>>>>> the
>>>>>> kernel to call
>>>>>> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts.
>>>>>>
>>>>>> I will work on developing a patch to this effect if no-one else is.
>>>>>>
>>>>>> Also, am I right in assuming that the maximum granularity of the
>>>>>> real-time
>>>>>> clock
>>>>>> on my system is 1/64th of a second ? :
>>>>>> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq
>>>>>> 64
>>>>>> This is the maximum granularity that can be stored in CMOS , not
>>>>>> returned by TSC? Couldn't we have something similar that gave an
>>>>>> accurate idea of TSC frequency and the precise formula applied to TSC
>>>>>> value to get clock_gettime
>>>>>> (CLOCK_MONOTONIC_RAW) value ?
>>>>>>
>>>>>> Regards,
>>>>>> Jason
>>>>>>
>>>>>>
>>>>>> This code does produce good timestamps with a latency of @20ns
>>>>>> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts)
>>>>>> values, but it depends on a global variable that is initialized to
>>>>>> the 'tsc_khz' value
>>>>>> computed by running kernel parsed from objdump /proc/kcore output :
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> U64_t
>>>>>> IA64_tsc_now()
>>>>>> { if(!( _ia64_invariant_tsc_enabled
>>>>>> ||(( _cpu0id_fd == -1) &&
>>>>>> IA64_invariant_tsc_is_enabled(NULL,NULL))
>>>>>> )
>>>>>> )
>>>>>> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant
>>>>>> TSC enabled.\n");
>>>>>> return 0;
>>>>>> }
>>>>>> U32_t tsc_hi, tsc_lo;
>>>>>> register UL_t tsc;
>>>>>> asm volatile
>>>>>> ( "rdtscp\n\t"
>>>>>> "mov %%edx, %0\n\t"
>>>>>> "mov %%eax, %1\n\t"
>>>>>> "mov %%ecx, %2\n\t"
>>>>>> : "=m" (tsc_hi) ,
>>>>>> "=m" (tsc_lo) ,
>>>>>> "=m" (_ia64_tsc_user_cpu) :
>>>>>> : "%eax","%ecx","%edx"
>>>>>> );
>>>>>> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo);
>>>>>> return tsc;
>>>>>> }
>>>>>>
>>>>>> __thread
>>>>>> U64_t _ia64_first_tsc = 0xffffffffffffffffUL;
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> U64_t IA64_tsc_ticks_since_start()
>>>>>> { if(_ia64_first_tsc == 0xffffffffffffffffUL)
>>>>>> { _ia64_first_tsc = IA64_tsc_now();
>>>>>> return 0;
>>>>>> }
>>>>>> return (IA64_tsc_now() - _ia64_first_tsc) ;
>>>>>> }
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> void
>>>>>> ia64_tsc_calc_mult_shift
>>>>>> ( register U32_t *mult,
>>>>>> register U32_t *shift
>>>>>> )
>>>>>> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift()
>>>>>> function:
>>>>>> * calculates second + nanosecond mult + shift in same way linux
>>>>>> does.
>>>>>> * we want to be compatible with what linux returns in struct
>>>>>> timespec ts after call to
>>>>>> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts).
>>>>>> */
>>>>>> const U32_t scale=1000U;
>>>>>> register U32_t from= IA64_tsc_khz();
>>>>>> register U32_t to = NSEC_PER_SEC / scale;
>>>>>> register U64_t sec = ( ~0UL / from ) / scale;
>>>>>> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1);
>>>>>> register U64_t maxsec = sec * scale;
>>>>>> UL_t tmp;
>>>>>> U32_t sft, sftacc=32;
>>>>>> /*
>>>>>> * Calculate the shift factor which is limiting the conversion
>>>>>> * range:
>>>>>> */
>>>>>> tmp = (maxsec * from) >> 32;
>>>>>> while (tmp)
>>>>>> { tmp >>=1;
>>>>>> sftacc--;
>>>>>> }
>>>>>> /*
>>>>>> * Find the conversion shift/mult pair which has the best
>>>>>> * accuracy and fits the maxsec conversion range:
>>>>>> */
>>>>>> for (sft = 32; sft > 0; sft--)
>>>>>> { tmp = ((UL_t) to) << sft;
>>>>>> tmp += from / 2;
>>>>>> tmp = tmp / from;
>>>>>> if ((tmp >> sftacc) == 0)
>>>>>> break;
>>>>>> }
>>>>>> *mult = tmp;
>>>>>> *shift = sft;
>>>>>> }
>>>>>>
>>>>>> __thread
>>>>>> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U;
>>>>>>
>>>>>> static inline __attribute__((always_inline))
>>>>>> U64_t IA64_s_ns_since_start()
>>>>>> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) )
>>>>>> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift);
>>>>>> register U64_t cycles = IA64_tsc_ticks_since_start();
>>>>>> register U64_t ns = ((cycles
>>>>>> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift);
>>>>>> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns %
>>>>>> NSEC_PER_SEC)&0x3fffffffUL) );
>>>>>> /* Yes, we are purposefully ignoring durations of more than 4.2
>>>>>> billion seconds here! */
>>>>>> }
>>>>>>
>>>>>>
>>>>>> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values
>>>>>> somehow,
>>>>>> then user-space libraries could have more confidence in using 'rdtsc'
>>>>>> or 'rdtscp'
>>>>>> if Linux's current_clocksource is 'tsc'.
>>>>>>
>>>>>> Regards,
>>>>>> Jason
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 20/02/2017, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>>>>>>> On Sun, 19 Feb 2017, Jason Vas Dias wrote:
>>>>>>>
>>>>>>>> CPUID:15H is available in user-space, returning the integers : ( 7,
>>>>>>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 ,
>>>>>>>> so
>>>>>>>> in detect_art() in tsc.c,
>>>>>>>
>>>>>>> By some definition of available. You can feed CPUID random leaf
>>>>>>> numbers
>>>>>>> and
>>>>>>> it will return something, usually the value of the last valid CPUID
>>>>>>> leaf,
>>>>>>> which is 13 on your CPU. A similar CPU model has
>>>>>>>
>>>>>>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340
>>>>>>> edx=0x00000000
>>>>>>>
>>>>>>> i.e. 7, 832, 832, 0
>>>>>>>
>>>>>>> Looks familiar, right?
>>>>>>>
>>>>>>> You can verify that with 'cpuid -1 -r' on your machine.
>>>>>>>
>>>>>>>> Linux does not think ART is enabled, and does not set the
>>>>>>>> synthesized
>>>>>>>> CPUID +
>>>>>>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not
>>>>>>>> see this bit set .
>>>>>>>
>>>>>>> Rightfully so. This is a Haswell Core model.
>>>>>>>
>>>>>>>> if an e1000 NIC card had been installed, PTP would not be
>>>>>>>> available.
>>>>>>>
>>>>>>> PTP is independent of the ART kernel feature . ART just provides
>>>>>>> enhanced
>>>>>>> PTP features. You are confusing things here.
>>>>>>>
>>>>>>> The ART feature as the kernel sees it is a hardware extension which
>>>>>>> feeds
>>>>>>> the ART clock to peripherals for timestamping and time correlation
>>>>>>> purposes. The ratio between ART and TSC is described by CPUID leaf
>>>>>>> 0x15
>>>>>>> so
>>>>>>> the kernel can make use of that correlation, e.g. for enhanced PTP
>>>>>>> accuracy.
>>>>>>>
>>>>>>> It's correct, that the NONSTOP_TSC feature depends on the
>>>>>>> availability
>>>>>>> of
>>>>>>> ART, but that has nothing to do with the feature bit, which solely
>>>>>>> describes the ratio between TSC and the ART frequency which is
>>>>>>> exposed
>>>>>>> to
>>>>>>> peripherals. That frequency is not necessarily the real ART
>>>>>>> frequency.
>>>>>>>
>>>>>>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems
>>>>>>>> to
>>>>>>>> be
>>>>>>>> nowhere else in Linux, the code will always think X86_FEATURE_ART
>>>>>>>> is
>>>>>>>> 0
>>>>>>>> because the CPU will always get a fault reading the MSR since it
>>>>>>>> has
>>>>>>>> never been written.
>>>>>>>
>>>>>>> Huch? If an access to the TSC ADJUST MSR faults, then something is
>>>>>>> really
>>>>>>> wrong. And writing it unconditionally to 0 is not going to happen.
>>>>>>> 4.10
>>>>>>> has
>>>>>>> new code which utilizes the TSC_ADJUST MSR.
>>>>>>>
>>>>>>>> It would be nice for user-space programs that want to use the TSC
>>>>>>>> with
>>>>>>>> rdtsc / rdtscp instructions, such as the demo program attached to
>>>>>>>> the
>>>>>>>> bug report,
>>>>>>>> could have confidence that Linux is actually generating the results
>>>>>>>> of
>>>>>>>> clock_gettime(CLOCK_MONOTONIC_RAW, &timespec)
>>>>>>>> in a predictable way from the TSC by looking at the
>>>>>>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space
>>>>>>>> use of TSC values, so that they can correlate TSC values with linux
>>>>>>>> clock_gettime() values.
>>>>>>>
>>>>>>> What has ART to do with correct CLOCK_MONOTONIC_RAW values?
>>>>>>>
>>>>>>> Nothing at all, really.
>>>>>>>
>>>>>>> The kernel makes use of the proper information values already.
>>>>>>>
>>>>>>> The TSC frequency is determined from:
>>>>>>>
>>>>>>> 1) CPUID(0x16) if available
>>>>>>> 2) MSRs if available
>>>>>>> 3) By calibration against a known clock
>>>>>>>
>>>>>>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_*
>>>>>>> values
>>>>>>> are
>>>>>>> correct whether that machine has ART exposed to peripherals or not.
>>>>>>>
>>>>>>>> has tsc: 1 constant: 1
>>>>>>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1
>>>>>>>
>>>>>>> And that voodoo math tells us what? That you found a way to
>>>>>>> correlate
>>>>>>> CPUID(0xd) to the TSC frequency on that machine.
>>>>>>>
>>>>>>> Now I'm curious how you do that on this other machine which returns
>>>>>>> for
>>>>>>> cpuid(15): 1, 1, 1
>>>>>>>
>>>>>>> You can't because all of this is completely wrong.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> tglx
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Attachment: ttsc.tar
Description: Unix tar archive