Re: [PATCH RFC] fs: turn inode->i_ctime into a ktime_t

From: Theodore Ts'o
Date: Sun May 26 2024 - 11:32:28 EST


On Sun, May 26, 2024 at 08:20:16AM -0400, Jeff Layton wrote:
>
> Switch the __i_ctime fields to a single ktime_t. Move the i_generation
> down above i_fsnotify_mask and then move the i_version into the
> resulting 8 byte hole. This shrinks struct inode by 8 bytes total, and
> should improve the cache footprint as the i_version and __i_ctime are
> usually updated together.

So first of all, this patch is a bit confusing because the patch
doens't change __i_ctime, but rather i_ctime_sec and i_ctime_nsec, and
Linus's tree doesn't have those fields. That's because the base
commit in the patch, a6f48ee9b741, isn't in Linus's tree, and
apparently this patch is dependent on "fs: switch timespec64 fields in
inode to discrete integers"[1].

[1] https://lore.kernel.org/all/20240517-amtime-v1-1-7b804ca4be8f@xxxxxxxxxx/

> The one downside I can see to switching to a ktime_t is that if someone
> has a filesystem with files on it that has ctimes outside the ktime_t
> range (before ~1678 AD or after ~2262 AD), we won't be able to display
> them properly in stat() without some special treatment. I'm operating
> under the assumption that this is not a practical problem.

There are two additional possible problems with this. The first is
that if there is a maliciously fuzzed file system with timestamp
outside of ctimes outside of the ktime_t range, this will almost
certainly trigger UBSAN warnings, which will result in Syzkaller
security bugs reported to file system developers. This can be fixed
by explicitly and clamping ranges whenever converting to ktime_t in
include/linux/fs.h, but that leads to another problem.

The second problem is if the file system converts their on-disk inode
to the in-memory struct inode, and then converts it from the in-memory
to the on-disk inode format, and the timestamp is outside of the
ktime_t range, this could result the on-disk inode having its ctime
field getting corrupted. Now, *most* of the time, whenver the inode
needs to be written back to disk, the ctime field will be changed
anyway. However, if there are changes that don't result
userspace-visible changes, but involves internal file system changes
(for example, in case of an online repair or defrag, or a COW snap),
where the file system doesn't set ctime --- and in it's possible that
this might result in the ctime field messed.

We could argue that ctime fields outside of the ktime_t range should
never, ever happen (except for malicious fuzz testing by systems like
syzkaller), and so it could be argued that we don't care(tm), but it's
worth at least a mention in the comments and commit description. Of
course, a file system which *did* care could work around the problem
by having their own copy of ctime in the file-specific inode, but this
would come at the cost of space saving benefits of this commit.

I'd suspect what I'd do is to add a manual check for an out of range
ctime on-disk, log a warning, and then clamp the ctime to the maximum
ktime_t value, which is what would be returned by stat(2), and then
write that clamped value back to disk if the ctime field doesn't get
set to the current time before it needs to be written back to disk.

Cheers,

- Ted