Re: [GIT PULL v2] timestamp fixes

From: Linus Torvalds
Date: Sat Sep 23 2023 - 13:49:18 EST


On Fri, 22 Sept 2023 at 23:36, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> Apparently, they are willing to handle the "year 2486" issue ;)

Well, we could certainly do the same at the VFS layer.

But I suspect 10ns resolution is entirely overkill, since on a lot of
platforms you don't even have timers with that resolution.

I feel like 100ns is a much more reasonable resolution, and is quite
close to a single system call (think "one thousand cycles at 10GHz").

> But the resolution change is counter to the purpose of multigrain
> timestamps - if two syscalls updated the same or two different inodes
> within a 100ns tick, apparently, there are some workloads that
> care to know about it and fs needs to store this information persistently.

Those workloads are broken garbage, and we should *not* use that kind
of sh*t to decide on VFS internals.

Honestly, if the main reason for the multigrain resolution is
something like that, I think we should forget about MG *entirely*.
Somebody needs to be told to get their act together.

We have *never* guaranteed nanosecond resolution on timestamps, and I
think we should put our foot down and say that we never will.

Partly because we have platforms where that kind of timer resolution
just does not exist.

Partly because it's stupid to expect that kind of resolution anyway.

And partly because any load that assumes that kind of resolution is
already broken.

End result: we should ABSOLUTELY NOT have as a target to support some
insane resolution.

100ns resolution for file access times is - and I'll happily go down
in history for saying this - enough for anybody.

If you need finer resolution than that, you'd better do it yourself in
user space.

And no, this is not a "but some day we'll have terahertz CPU's and
100ns is an eternity". Moore's law is dead, we're not going to see
terahertz CPUs, and people who say "but quantum" have bought into a
technological fairytale.

100ns is plenty, and has the advantage of having a very safe range.

That said, we don't have to do powers-of-ten. In fact, in many ways,
it would probably be a good idea to think of the fractional seconds in
powers of two. That tends to make it cheaper to do conversions,
without having to do a full 64-bit divide (a constant divide turns
into a fancy multiply, but it's still painful on 32-bit
architectures).

So, for example, we could easily make the format be a fixed-point
format with "sign bit, 38 bit seconds, 25 bit fractional seconds",
which gives us about 30ns resolution, and a range of almost 9000
years. Which is nice, in how it covers all of written history and all
four-digit years (we'd keep the 1970 base).

And 30ns resolution really *is* pretty much the limit of a single
system call. I could *wish* we had system calls that fast, or CPU's
that fast. Not the case right now, and sadly doesn't seem to be the
case in the forseeable future - if ever - either. It would be a really
good problem to have.

And the nice thing about that would be that conversion to timespec64
would be fairly straightforward:

struct timespec64 to_timespec(fstime_t fstime)
{
struct timespec64 res;
unsigned int frac;

frac = fstime & 0x1ffffffu;
res.tv_sec = fstime >> 25;
res.tv_nsec = frac * 1000000000ull >> 25;
return res;
}

fstime_t to_fstime(struct timespec64 a)
{
fstime_t sec = (fstime_t) a.tv_sec << 25;
unsigned frac = a.tv_nsec;

frac = ((unsigned long long) a.tv_nsec << 25) / 1000000000ull;
return sec | frac;
}

and both of those generate good code (that large divide by a constant
in to_fstime() is not great, but the compiler can turn it into a
multiply).

The above could be improved upon (nicer rounding and overflow
handling, and a few modifications to generate even nicer code), but
it's not horrendous as-is. On x86-64, to_timespec becomes a very
reasonable

movq %rdi, %rax
andl $33554431, %edi
imulq $1000000000, %rdi, %rdx
sarq $25, %rax
shrq $25, %rdx

and to some degree that's the critical function (that code would show
up in 'stat()').

Of course, I might have screwed up the above conversion functions,
they are untested garbage, but they look close enough to being in the
right ballpark.

Anyway, we really need to push back at any crazies who say "I want
nanosecond resolution, because I'm special and my mother said so".

Linus