RE: [PATCH v3] printk: fix zero-valued printk timestamps in early boot

From: Bird, Tim

Date: Mon Mar 30 2026 - 16:45:18 EST

Thomas,

Your response is both very helpful and a bit frustrating. See below.

> -----Original Message-----
> From: Thomas Gleixner <tglx@xxxxxxxxxx>
> Tim!
>
> On Fri, Mar 27 2026 at 18:48, Tim Bird wrote:
> > Well, this is using get_cycles(), which already exists on most architectures.
>
> The fact that get_cycles() exists does not make it a good choice. There
> is a reason why anything which deals with even remotely reliable time
> requirements stopped using it. It's still there as a low level
> architecture specific interface and most other usage is purely
> historical or wrong to begin with and should be removed completely.
>
> A lot of people spent a significant amount of time to get rid of this
> ill defined mechanism and it's just sad that they did not manage to
> eliminate it completely.
>
That's good to know. Thanks. For what it's worth, I wouldn't describe
my desired usage as being for remotely reliable time requirements.

> > This patch just adds a funky way to use cycles (which are available
> > from power on, rather than from the start of kernel timekeeping) to
> > allow saving timing data for some early printks (usually about 20 to
> > 60 printks).
>
> I can see that, but I'm not accepting yet another ill defined glued on
> mechanism which relies on a historical ill defined mistake.
>
> > Also, my current plan is to back off of adjusting the offset of
> > unrelated (non-pre-time_init()) printks, and limit the effect in the
> > system to just those first early (pre-time_init()) printks. The
> > complication to add an offset to all following printks was just to
> > avoid a discontinuity in printk timestamps, once time_init() was
> > called and "real" timestamps started producing non-zeros. Given how
> > confusing this seems to have made things, I'm thinking of backing off
> > of that approach.
>
> This discontinuity results from the fact that you glued it into the
> printk code and sched_clock() does not know about it.

Yes. Of course. That was intentional.

>
> >> printk()
> >>
> >> time_ns = local_clock();
> > that's ts_nsec = local_clock()
>
> That obviously changes the illustrative nature of my narrative
> significantly. Thanks for pointing it out.

I was confused by the variable in your narrative, since I hadn't seen it.
I grepped and reexamined the code. I clarified what I believed to
be the variable you were referencing. I could have worded this better,
but I don't believe you needed to respond with sarcasm.

>
> >> As this needs to be supported by the architecture/platform in any case
> >> there is close to zero benefit from creating complicated generic
> >> infrastructure for this.
> >
> > The problem with this is that tsc_early_uncalibrated() can't return
> > nanoseconds until after calibration.
>
> In theory it could for most modern x86 CPUs as they advertise the
> nominal TSC frequency in CPUID. Other architectures have well known
> clocksource frequencies, e.g. S390 has a known nominal frequency of 1GHz
> (IIRC).
>
> But that does not solve any of the other problems. See below.
>
> > I don't think it's a good idea to returns cycles sometimes and nanoseconds
> > at other times, from a deep-seated timing function like this.
> > Also tsc_available() might itself depend on initialization that hasn't happened yet
> > (in early boot).
>
> Access to the TSC requires the X86_FEATURE_TSC bit being set, which
> happens in early_cpu_init(). Before that get_cycles() returns firmly 0.
>
Yes.

> > My approach of saving cycles in ts_nsec for the early printks works
> > because there's a limited number of places (only 2) inside the printk
> > code where ts_nsec is accessed, meaning that the code to detect a
> > cycles value instead of a nanoseconds value can be constrained to just
> > those two places. Basically, I'm doing the conversion from cycles to
> > nanoseconds at printk presentation time, rather than at the time of
> > printk message submission.
>
> I know, but that again requires to add more ill defined infrastructure.
Quite possibly you are right. Are you saying that the general concept
of saving cycles to be converted later to nanoseconds is the ill-defined
infrastructure, or are you saying using get_cycles() is not safe or
accurate enough?
From what you say below, I believe you are saying the latter.

If you don't like the deferred cycles conversion, that's fine. I like the approach
you demonstrated in your subsequent PoC patch to show super-early
tsc calibration (using a kernel param). I can certainly live with 14 microseconds
of missed optimization information, if that's all it is.

>
> We are not aiming to add more, we want to get rid of it completely to
> the extent possible.

OK.

>
> > The approach that I originally started with
> > (see https://lore.kernel.org/linux-embedded/39b09edb-8998-4ebd-a564-7d594434a981@xxxxxxxx/
> > was to use hardcoded multiplier and shift values for converting from cycles
> > to nanoseconds. These multiplier and shift values would be set at kernel
> > configuration time (ie, using CONFIG values).
>
> Which makes it unusable for distro kernels and therefore a non-starter.

Can you elaborate on this? Indeed distro kernels would not be able to pre-set
TCS calibration values, and would not turn on this feature (in that version
of the patch) for production release kernels. But it sounds like you are saying that
anything that requires a configuration that is non-general (or indeed used
temporarily during development) is not acceptable upstream. Is that your position?

This "feature" is intended as a tool for developers who are optimizing Linux
kernel boot time. (I'm not sure who else would be interested in getting
timing data for these (currently zero-timestamped) printks during the
first 100-400 milliseconds of kernel boot). This would be, I believe, developers
who can change their configs and compile their kernels. The v1 version of
the patch included calculations and printks to help developers set the
calibration values for their hardware, so it was not targeted at my machine
only.

This patch is part of a larger effort on my part to help automate boot-time tuning
of the kernel. Many other parts of that effort rely on reconfiguration and recompilation
of the kernel, which makes the whole thing a development-time effort, not so much
a run-time, end-user, or production-level feature. And very much not a thing that
can be accomplished with distro-only configs.

>
> > There are other approaches, but none really work early enough in the
> > kernel boot to not be a pain. The goal is to provide timing info
> > before: timekeeping init, jiffies startup, and even CPU features
> > determination,
>
> As I pointed out before that's wishful thinking:
>
> You _cannot_ access a resource before it has been determined to be
> available.
>
> Period.
>
> It does not matter at all if _you_ know for sure that it is the case in
> _your_ personal setup.

I used get_cycles(), which has a check for availability in it, so the patch didn't
access a resource before it was determined to be available.
It sounds like you're responding to my wording above and not the patch itself.

It is not my intent to handle only my personal setup.
Your help to make sure that this feature is as general as possible is much appreciated.

>
> > and to keep the effect narrow -- limited only to printks, and the
> > first few pre-time_init() printk messages, at that.
>
> Either it is solved in a generic way or we have to agree that it's not
> solvable at all.

I disagree that solving this limited problem (zero-valued timestamps
in early boot) has to be solved in a generic way.
I already limited the solution space to only processors that I believed
had reliable, pre-kernel-initialized cycle generators.

I think it's fair to have a specialized solution to a specialized problem, if
it can be made to have very limited effect on other code.

I tried to avoid affecting any other timekeeping mechanism
(ie re-engineering local_clock), specifically to avoid unwanted
side effects.

> Your narrow effect argument is bogus and you know that
> very well.

This sounds like you believe I am arguing in bad faith.
I don't believe that I am.

>
> > I'm now researching a suggestion from Shashank Balaji to use the
> > existing calibration data from tsc initialization, which might
> > simplify the current patch even further. I'll make sure to CC you on
> > the next version of the patch.
>
> If you want to use the calibration data from tsc_early_init() then you
> achieve exactly _nothing_ because tsc_early_init() also enables the
> early sched clock on bare metal. On a VM with KVM clock available the
> KVM clock setup enables the early sched clock even before that via
> init_hypervisor_platform().

In the v3 patch, the timing of sched_clock activation is independent of the
timing of the use of the tsc calibration values. I don't understand
how this comment is relevant to my patch.

>
> The early TSC init happens in setup_arch() via tsc_early_init() and it's
> completely unclear whether you can always access the TSC safely before
> that unconditionally due to SNP, which requires to enable the secure TSC
> first. There is a reason why all of this is ordered the way it is.

OK. Thanks for that info. I'll take a look at that.
What happens if you try a rdtsc() before tsc_early_init? If it returns zero, I can
live with that. If it faults or returns random data that's a problem.

>
> While reading TSC way before that might work on bare metal and in most
> VMs, it's not guaranteed to be safe unconditionally unless someone sits
> down and provides proof to the contrary. As always I'm happy to be
> proven wrong.
>
> When tsc_early_init() was introduced for the very same reason you are
> looking into that, quite some people spent a lot of time to come up with
> a solution which was deemed safe enough to be used unconditionally.
>
> Please consult the LKML archive for the full history. The commit links
> will give you a proper starting point.

OK - thanks for the information and the pointers. get_cycles() does
have a CPU features check, so my access to the tsc was not completely
unconditional. I'll do some more research here and try to make the
ultimate solution as safe as possible.

> That said, I completely understand the itch you are trying to scratch
> and I'm the least person to prevent an architecturally sound solution,
> but I'm also the first person to NAK any attempt which is based on
> uninformed claims and 'works for me' arguments.

Well, one of the purposes of posting patches is to get feedback
on different approaches. I'm honestly not trying to create a solution that only
works on my hardware. But neither am I trying to boil the ocean
here.

It's not clear to me that there's much harm in having
a single discontinuity early in the printk timestamps. No one has told
me yet that printk timestamps absolutely MUST be monotonically
increasing, everywhere.
Trying to solve that issue led to a more convoluted solution than
my first approach (deferred cycles conversion versus compile-time
calibration data).

I also don't think there's much harm if the data for these few printks
is unavailable or wrong on some hardware. Maybe that's an area where
we differ in opinion.

>
> The only clean way to solve this cleanly is moving the sched clock
> initialization to the earliest point possible and accepting that due to
> hardware, enumeration and virtualization constraints this point might be
> suboptimal. Everything else is just an attempt to defy reality.

I think we're willing to accept different areas of sub-optimality.
I can live with having to configure the feature, and statically configure
the TSC clock calibration, at the expense of non-monotonic printk
timestamps for the early printks.

Personally, I think that altering sched_clock or any other
kernel timekeeping is overkill for this.

I do like the solution proposed in your patch. I considered a
solution using an early-parsed kernel command line arg, but didn't
know how early I could do that. I'll review your patch. Thanks!

And I appreciate the time you took to educate me about some of the
other issues involved (especially secure TSC and possible TSC unavailability,
and VM timekeeping issues, which complicate this.)

Regards,
-- Tim