Re: [GIT pull] printk updates for 4.15

From: Linus Torvalds
Date: Tue Nov 14 2017 - 12:21:05 EST


On Tue, Nov 14, 2017 at 2:03 AM, Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> AFAIK, local clock is not synchronous between different machines
> and even CPUs on the same machine. It was used in printk() because
> it was lockless. Therefore it is kind of random itself.

It's not that it would be synchronous between machines, but that it
would behave the same way.

Lookie here, the issue is this:

static ssize_t msg_print_ext_header(..
{
u64 ts_usec = msg->ts_nsec;

do_div(ts_usec, 1000);

return scnprintf(buf, size, "%u,%llu,%llu,%c;",
(msg->facility << 3) | msg->level, seq, ts_usec,
msg->flags & LOG_CONT ? 'c' : '-');
}

this is an _exported_ interface to user space.

Is it an odd interface? Yes it is. It was "designed" by the systemd
people. See commit e11fea92e13f ("kmsg: export printk records to the
/dev/kmsg interface").

But that really means that you can't just change what "ts_nsec" means
- it's exported as "microseconds since boot".

And yes, at least systemd actually uses it. I cloned the git
repository and checked. It does something like

unsigned long long usec;
...
r = safe_atollu(p, &usec);
...
if (asprintf(&source_time, "_SOURCE_MONOTONIC_TIMESTAMP=%llu",
usec) >= 0)
iovec[n++] = IOVEC_MAKE_STRING(source_time);

when reading /dev/kmsg, and then elsewhere it does

PARSE_FIELD_VEC_ENTRY("_SOURCE_MONOTONIC_TIMESTAMP=",
&monotonic, &monotonic_len),

if (mode == OUTPUT_SHORT_MONOTONIC)
r = output_timestamp_monotonic(f, j, monotonic);

Now, the important thing to keep in mind is that the kernel is not the
_only_ source of these "_SOURCE_MONOTONIC_TIMESTAMP" things.

In other places, systemd generates them itself, with

ts->monotonic = now(CLOCK_MONOTONIC);

where "now()" ends up using clock_gettime().

End result? The kernel /dev/kmsg interface has to be at least
"similar" to returning usec's comparable to CLOCK_MONOTONIC.

You can't make it just do realtime. That's a whole different clock.

Now, admittedly "local_clock()" (aka sched_clock()) that printk uses
now is not great either, but at least it has the same basic time base
as CLOCK_MONOTONIC, and isn't off by 47 years.

So *that* is my worry. I don't expect clocks across machines to have
the same value. But I expect them to have the same *behavior*, and not
see really odd system journal behavior just because one machine ends
up using wall clock time, and one ends up using a monotonic clock, and
one ends up using the traditional jiffies or whatever.

And honestly, if we decide to change the time format, the absolute
*LAST* thing I want to see is to make this a config option.

Either we do it or we don't. Not this insane "you can choose between
five different choices, I leave it up to you to screw up your
machine".

Seriously. This whole "we can't even make a choice, so we let the user
screw with it" is pure and utter garbage.

If you as a kernel developer cannot make a choice., and argue strongly
for _why_ that choice is the right one to export to user space, then
we do not change existing behavior, since clearly you don't really
have a good reason for the change.

This is why I would suggest just keeping the existing clock, and then
adding clock events at _significant_ points. Somebody brought up the
complete red herring of "every five minutes will fiill up the system
log". That's complete BS. You do it when people change the system
clock (and perhaps when NTP synchronizes), or after a suspend event,
or maybe regularly every few _hours_ (and make that last part
configurable).

Filling up the system log is simply not an issue.

Linus