Re: [PATCH v1 1/2] x86/tsc: use logical_package as a better estimation of socket numbers
From: Feng Tang
Date: Tue Oct 25 2022 - 03:57:33 EST
On Mon, Oct 24, 2022 at 08:43:33AM -0700, Dave Hansen wrote:
> On 10/24/22 00:37, Feng Tang wrote:
> >> For instance, I can live with the implementation being a bit goofy when
> >> kernel commandlines are in play. We can pr_info() about those cases.
> > Something like adding
> >
> > pr_info("Watchdog for TSC is disabled for this platform while estimating
> > the socket number is %d, if the real socket number is bigger than
> > 4 (may due to some tricks like 'maxcpus=' cmdline parameter, please
> > add 'tsc=watchdog' to cmdline as well\n", logical_packages);
>
> That's too wishy-washy. Also, I *KNOW* Intel has built systems with
> wonky, opaque numbers of "sockets". Cascade Lake was a single physical
> "socket", but in all other respects (including enumeration to software)
> it acted like two logical sockets.
>
> So, what was the "real" socket number for Cascade Lake? If you looked
> in a chassis, you'd see one socket. But, there were two dies in that
> socket talking to each other over UPI, so it had a system topology which
> was indistinguishable from a 2-socket system.
Good to know and thanks for the info.
I have to admit I haven't checked a server's internals before, and
thanks to Oliver for helping checking some Cascade Lake boxes of 0Day.
In one box where 'lscpu' shows 4 sockets (96 cores in total), it does
only have 2 physical processors in the chassis, just like you
mentioned, it has 2 dies for each processor. And in another box,
'lscpu' shows 2 sockets (44 cores in total), it also has 2 physical
processors but with much smaller size.
And fortunately the 'logical_packages' for these 2 boxes are both
the correct value: 2.
> Let's just state the facts:
>
> pr_info("Disabling TSC watchdog on %d-package system.", ...)
>
> Then, we can have a flag elsewhere to say how reliable that number is.
> A taint flag or CPU bug is probably going to far, but something like this:
>
> bool logical_package_count_unreliable = false;
>
> void mark_bad_package_count(char *reason)
> {
> if (logical_package_count_unreliable)
> return true;
>
> pr_warn("processor package count is unreliable");
> }
>
> Might be OK. Then you can call mark_bad_package_count() from multiple
> sites, like the maxcpus= code.
This should work! we can just add one more check:
boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
!logical_package_count_unreliable &&
logical_packages <= 2
And when some new case leading to a imprecise 'logical_packages' is
found in future, we could just apply to this to it.
> But, like I said in the other thread, let's make sure we're agreed on
> the precise problem that we're solving before we go down this road.
Sure.
Thanks,
Feng
> > and adding a new 'tsc=watchdog' option to force watchdog on (might be
> > over-complexed?)
>
> Agreed, I don't think that's quite warranted yet.