Re: [PATCH v3] Add kernel logs when sched clock unstable and NO_HZ_FULL is not possible

From: Ani Sinha
Date: Sat Jun 26 2021 - 12:19:07 EST




On Tue, 22 Jun 2021, Thomas Gleixner wrote:

> On Sun, Jun 20 2021 at 15:43, Ani Sinha wrote:
>
> > Commit 4f49b90abb4aca ("sched-clock: Migrate to use new tick
> > dependency mask model") had also removed the kernel warning
> > message informing the user that it was not possible to turn
> > on NO_HZ_FULL. Adding back that log message here. It is
> > unhelpful when the kernel turns off NO_HZ_FULL silently
> > without informing anyone.
> > Also added a kernel log when sched clock is marked as unstable.
>
> Don't do two things at once. See Documentation/process/....

Ok I will split up this patch into two.

>
> Also your subject line want's a proper prefix.

OK will fix.

>
> > diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> > index c2b2859ddd82..9f9fe658f8a5 100644
> > --- a/kernel/sched/clock.c
> > +++ b/kernel/sched/clock.c
> > @@ -192,8 +192,11 @@ void clear_sched_clock_stable(void)
> >
> > smp_mb(); /* matches sched_clock_init_late() */
> >
> > - if (static_key_count(&sched_clock_running.key) == 2)
> > + if (static_key_count(&sched_clock_running.key) == 2) {
> > + WARN_ONCE(sched_clock_stable(),
> > + "sched clock is now marked unstable.");
>
> What's the WARN for here? That backtrace is largely uninteresting.

OK to be consistent with other parts of the code, I will replace this with
pr_warn()

>
> > - if (can_stop_full_tick(cpu, ts))
> > + if (can_stop_full_tick(cpu, ts)) {
> > tick_nohz_stop_sched_tick(ts, cpu);
> > - else if (ts->tick_stopped)
> > - tick_nohz_restart_sched_tick(ts, ktime_get());
> > + } else {
> > + /*
> > + * Don't allow the user to think they can get
> > + * full NO_HZ with this machine.
> > + */
> > + WARN_ONCE(tick_nohz_full_running,
> > + "NO_HZ_FULL will not work for the current system.");
>
> can_stop_full_tick() returning false can be transient and then the user
> still has no idea _why_ this is printed.
>
> Also assume the user/admin starts perf and knows he's going to disturb
> NOHZ full, then _why_ would he be interested in that warning.
>

If NOHZ is disabled intentionally, that is not an interesting case. I am
worried about the situation where the user specifies NOHZ option in the
kernel commandline and the kernel silently disabled this because of one or
more limitations in the system. I want to address this. All the reasons
specified in the following commit is still true:

commit e12d0271774fea9fddf1e2a7952a0bffb2ee8e8b
Author: Steven Rostedt <rostedt@xxxxxxxxxxx>
Date: Fri May 10 17:12:28 2013 -0400

nohz: Warn if the machine can not perform nohz_full

If the user configures NO_HZ_FULL and defines nohz_full=XXX on the
kernel command line, or enables NO_HZ_FULL_ALL, but nohz fails
due to the machine having a unstable clock, warn about it.

We do not want users thinking that they are getting the benefit
of nohz when their machine can not support it.

Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: H. Peter Anvin <hpa@xxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Li Zhong <zhong@xxxxxxxxxxxxxxxxxx>
Signed-off-by: Frederic Weisbecker <fweisbec@xxxxxxxxx>

This log was removed from the kernel in commit

commit 4f49b90abb4aca6fe677c95fc352fd0674d489bd
Author: Frederic Weisbecker <fweisbec@xxxxxxxxx>
Date: Wed Jul 22 17:03:52 2015 +0200

sched-clock: Migrate to use new tick dependency mask model

Instead of checking sched_clock_stable from the nohz subsystem to
verify
its tick dependency, migrate it to the new mask in order to include it
to the all-in-one check.


and it seems to me that the removal of the log defeats the purpose that
commit e12d0271774fea9fddf tried to serve.

Please enlighten me as to how to achieve this and I will cook up
something.

Ani