Re: [PATCH 1/3] x86/kernel: Add option that TSC on Socket 0 being non-null is valid

From: Thomas Gleixner
Date: Tue Sep 26 2017 - 03:29:28 EST


On Mon, 25 Sep 2017, Mike Travis wrote:
> On 9/25/2017 11:10 AM, Thomas Gleixner wrote:
> > The reason why I put it there in the first place was to make the TSC
> > deadline timer work on a whole range of systems. It turned out that our
> > 'fix' was not enough so we changed that later to disable the deadline timer
> > completely on affected systems when the firmware does not contain a fix for
> > it. So there is no real technical reason anymore to enforce TSC_ADJUST == 0
> > on the boot CPU. So rather than special casing this for UV we should just
> > remove that requirement and leave the boot value as is.
>
> Okay, I could do that but as I mentioned I'm not comfortable changing code
> that I cannot personally verify is correct. I'm assuming you only mean
> removing the part where socket 0 should have a TSC_ADJUST value of 0? And the
> part mentioned below that all cpu threads on a socket should have the same
> adjust values should not be changed?

Right, only the restriction of TSC_ADJUST == 0 on the boot CPU can be
removed. If the TSC_ADJUST values in a package differ, then BIOS did
something really stupid because the underlying counter is the same for all
threads in a package and different TSC_ADJUST values will result in TSC out
of sync being detected.

> > > Our BIOS team did make a change to conform to the "TSC_ADJUST should be
> > > the
> > > same on all cpu threads on a single socket" requirement, so we were able
> > > to
> > > pass that part of the TSC validation functions. (Prior to this, the TSC's
> > > were synced by writing directly to the TSC MSR and natural delays in the
> > > processor firmware caused the slight differences in the TSC ADJUST
> > > values.)
> >
> > Right. TSC_ADJUST is there for a reason.
>
> Well, there is the problem with bit 63 (sign bit) in the TSC ADJUST MSR. So
> it's almost there... :)

What's the problem with bit 63?

> > But back to my question about ART. You might talk to your BIOS/HW folks
> > about that and eventually disable the ART related functionality in the
> > kernel on UV.
>
> Once again, I'm not sure what code you are talking about. I talked to the
> BIOS engineer specifically working on TSC related items and he was also not
> familiar with the Intel ART spec. I could go farther up the chain to the
> hardware designers but right now this would be extremely low on their priority
> queue.
>
> Perhaps if you were more specific in what code should be disabled? We would
> like to keep as much as the self check code as possible as there are still
> some remotely possible failure cases. And removing the self checks would have
> a negative impact on reliability.

detect_art() -> X86_FEATURE_ART

X86_FEATURE_ART enables the hardware assisted PTP <-> TSC correlation in
network cards and is used for network synchronized audio stuff. It's
probably a non issue today but TSN is becoming more widespread so I expect
more users in the foreseeable future. We can just drop out in detect_art()
if the system is UV so X86_FEATURE_ART does not get set and all related
functionality is disabled automagically.

> I actually did test using the CPU flags, "TSC reliable" and no "TSC ADJUST" to
> prevent the kernel from overwriting the already correct TSC ADJUST values.
> But that also turned off the clock source watchdog and having the
> check_tsc_sync_target() test run was a good verification step. Our goal
> was/is to conform as much as possible to the checks and the only unfixable
> code was the assumption that socket 0 should have a TSC_ADJUST value of 0.

Which can be removed. If you don't want to do it. I'm happy to write the
patch^Wchangelog myself.

Thanks,

tglx