Re: [RFC] tick_sched_timer() is not properly aligned, fixed by chance

From: Peter Zijlstra
Date: Thu Apr 06 2023 - 06:55:43 EST


On Thu, Apr 06, 2023 at 11:57:35AM +0200, Sebastian Andrzej Siewior wrote:
> With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
> expire the timer_list timers. This timer is programmed accurate in
> respect to CLOCK_MONOTONIC so that 0 is the first tick and the next one
> is 1000/CONFIG_HZ ms later. For HZ=250 every 4 ms and so the us/ns part
> of the timer is always 0. This can be observed by enabling hrtimer_start
> events and looking for the tick_sched_timer timer:
>
> | <idle>-0 [009] 137.085041: hrtimer_start: hrtimer=000000002df4f2bc function=tick_sched_timer expires=162784000000 softexpires=162784000000 mode=0xa
> | <idle>-0 [026] 137.085300: hrtimer_start: hrtimer=00000000b0911b5d function=tick_sched_timer expires=316384000000 softexpires=316384000000 mode=0xa
>
> The lower part (us, ns) is zero. This is important for certain work
> loads where it is needed to either align with the timer or hide after
> it.
>
> With commit
> 857baa87b6422 ("sched/clock: Enable sched clock early")
>
> merged into v4.19-rc1 this 0 offset is gone. The problematic part is the
> last hunk of the patch:
>
> |@@ -356,7 +374,7 @@ u64 sched_clock_cpu(int cpu)
> | return sched_clock() + __sched_clock_offset;
> |
> | if (unlikely(!sched_clock_running))
> |- return 0ull;
> |+ return sched_clock();
> |
> | preempt_disable_notrace();
> | scd = cpu_sdc(cpu);
>
> which returns the sched_clock() without any offset. The gain of this is
> the high resolution output of the bootlog very early, so we see something
> like this:
> |[ 0.000000] tsc: Detected 1995.083 MHz processor
> |[ 0.000893] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> |[ 0.008179] e820: remove [mem 0x000a0000-0x000fffff] usable
> |[ 0.014395] last_pfn = 0xa40000 max_arch_pfn = 0x400000000
> |[ 0.020667] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
> |[ 0.029822] last_pfn = 0xbe000 max_arch_pfn = 0x400000000
> |[ 0.050770] found SMP MP-table at [mem 0x000fd2a0-0x000fd2af]
> |[ 0.057015] Kernel/User page tables isolation: disabled on command line.
> |[ 0.064452] Using GB pages for direct mapping
> |[ 0.070337] RAMDISK: [mem 0x32d23000-0x35688fff]
> |[ 0.075298] ACPI: Early table checksum verification disabled
>
> |[ 1.662577] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> |[ 1.687317] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398412f243e, max_idle_ns: 881590620736 ns
> |[ 1.699071] Calibrating delay loop (skipped), value calculated using timer frequency.. 3990.16 BogoMIPS (lpj=7980332)
> |[ 1.703070] pid_max: default: 32768 minimum: 301
> |[ 1.707238] LSM: Security Framework initializing
>
> but the tick_sched_timer() is no longer properly aligned:
> | <idle>-0 [002] 131.286274: hrtimer_start: hrtimer=000000000b8c5637 function=tick_sched_timer expires=130680348871 softexpires=130680348871 mode=0xa
> | <idle>-0 [006] 131.286275: hrtimer_start: hrtimer=00000000e08369e4 function=tick_sched_timer expires=317880348871 softexpires=317880348871 mode=0xa

But how ?!? Why does sched_clock() affect the tick_sched_timer() offset?

/me reads on

> instead of 0 we have "348871" here. I consider this as a bug given that
> this important.
>
> Then PeterZ had other problems and committed
> 776f22913b8e5 ("sched/clock: Make local_clock() noinstr")
>
> which is part of v6.3-rc1. The important part here is (again) the last
> hunk, the local_clock() implementation:
>
> |noinstr u64 local_clock(void)
> |{
> | u64 clock;
> |
> | if (static_branch_likely(&__sched_clock_stable))
> | return sched_clock() + __sched_clock_offset;
> |
> | preempt_disable_notrace();
> | clock = sched_clock_local(this_scd());
> | preempt_enable_notrace();
> |
> | return clock;
> |}

And this morning I found:

https://lore.kernel.org/all/20230406040847.569970-1-dev@xxxxxxxxxx/T/#u

that 'fixes' that :-)

> The tick_sched_timer story is that it uses ktime_get() to set the

But but but, ktime_get() does not use sched_clock(), it has it's own TSC
reader.

> initial expire time and then always incremented. If the returned value
> is jiffy-based then it will work as expected. If it has higher precision
> then the time has to be adjusted since it is not expected.
>
> So, what do we do?
> - We would need to provide stable patches for v4.19…v6.1 to address the
> timer issue.
>
> - The high precision timestamps during boot are gone. Is it important to
> bring that back? Or could that new local_clock() implementation be
> backported?
>
> - An alternative is to ensure that the time returned by ktime_get() in
> tick_setup_device() is aligned with zeros.

The last one I think, let me try and figure out how sched_clock()
behaviour makes a different at all -- this is not making sense.