Re: [PATCH v1 1/2] x86/tsc: use logical_package as a better estimation of socket numbers

From: Feng Tang
Date: Tue Oct 25 2022 - 03:36:08 EST


On Mon, Oct 24, 2022 at 08:42:30AM -0700, Dave Hansen wrote:
> On 10/22/22 09:12, Zhang Rui wrote:
> >>> I'm not sure if we have a perfect solution here.
> >> Are the implementations fixable?
> > currently, I don't have any idea.
> >
> >> Or, at least tolerable?
>
> That would be great to figure out before we start throwing more patches
> around.

Yes, agreed!

> >> For instance, I can live with the implementation being a bit goofy
> >> when
> >> kernel commandlines are in play. We can pr_info() about those cases.
> > My understanding is that the cpus in the last package may still have
> > small cpu id value. This means that the 'logical_packages' is hard to
> > break unless we boot with very small CPU count and happened to disable
> > all cpus in one/more packages. Feng is experiencing with this and may
> > have some update later.
> >
> > If this is the case, is this a valid case that we need to take care of?
>
> Well, let's talk through it a bit.
>
> What is the triggering event and what's the fallout?

In worst case (2 sockets), if the maxcpus falls to '<= total_cpus/2',
the 'logical_packages' will be less than the real number.

> Is the user on a truly TSC stable system or not?
>
> What kind of maxcpus= argument do they need to specify? Is it something
> that's likely to get used in production or is it most likely just for
> debugging?

IIUC, for the server side, it's most likely for debug use. And for
clients, socket number is not an issue.

> What is the maxcpus= fallout? Does it over estimate or under estimate
> the number of logical packages?

Only under estimate.

> How many cases outside of maxcpus= do we know of that lead to an
> imprecise "logical packages" calculation?

Thanks to you, Peter and Rui's info, we have listed a bunch of
user cases than 'maxcpus', and they won't lead to imprecise
'logical_packages'. And I'm not sure if there is other case which
hasn't poped up.

> Does this lead to the TSC being mistakenly marked stable when it is not,
> or *not* being marked stable when it is?

Only the former case 'mistakenly marked stable' is possible, say we
use 'maxcpus=8' on a 192 core 8 sockets machine.

> Let's get all of that info in one place and make sure we are all agreed
> on the *problem* before we got to the solution space.

OK.

Thanks,
Feng