Re: [Y2038] [RFC 02/15] vfs: Change all structures to support 64 bit time

From: Arnd Bergmann
Date: Thu Jan 14 2016 - 17:46:26 EST


On Friday 15 January 2016 08:00:01 Dave Chinner wrote:
> On Thu, Jan 14, 2016 at 05:53:21PM +0100, Arnd Bergmann wrote:
> > On Thursday 14 January 2016 08:04:36 Dave Chinner wrote:
> > > On Wed, Jan 13, 2016 at 08:33:16AM -0800, Deepa Dinamani wrote:
> > c) The opposite direction from b) is to first change the common
> > code, but then any direct assignment between a timespec in
> > a file system and the timespec64 in the inode/iattr/kstat/etc
> > first needs a conversion helper so we can build cleanly,
> > and then we do one file system at a time to remove them all
> > again while changing the internal structures in the
> > file system from timespec to timespec64.
>
> No new helpers are necessary - we've already got the helper
> functions we need. This:
>
> int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
> {
> struct inode *inode = d_inode(old_dentry);
> + struct inode_timespec now = current_fs_time(inode->i_sb);
>
> - inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
> + VFS_INODE_SET_XTIME(i_ctime, inode, now);
> + VFS_INODE_SET_XTIME(i_mtime, dir, now);
> + VFS_INODE_SET_XTIME(i_ctime, dir, now);
> inc_nlink(inode);
> .....
>
> is just wrong. All the type conversion and clamping and checking
> done in that VFS_INODE_SET_XTIME() should be done in
> current_fs_time() and have it return a timespec64 directly. Indeed,
> it already does truncation, and can easily be made to do range
> clamping, too. i.e. the change should simply be:
>
> - inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
> + inode->i_ctime = dir->i_ctime = dir->i_mtime = current_fs_time(inode->i_sb);


Yes, that is the obvious case, and I guess works for at least half the
file systems when they always assign righthand side and lefthand
side of the time stamps using the external types or helpers like
CURRENT_TIME and current_fs_time().

However, there are a couple of file systems that need a bit more refactoring
before we can do this, e.g. in ntfs_truncate:

if (!IS_NOCMTIME(VFS_I(base_ni)) && !IS_RDONLY(VFS_I(base_ni))) {
struct timespec now = current_fs_time(VFS_I(base_ni)->i_sb);
int sync_it = 0;

if (!timespec_equal(&VFS_I(base_ni)->i_mtime, &now) ||
!timespec_equal(&VFS_I(base_ni)->i_ctime, &now))
sync_it = 1;
VFS_I(base_ni)->i_mtime = now;
VFS_I(base_ni)->i_ctime = now;
}

The type of the local variable must match the return code of
current_fs_time(), so if we change over i_mtime and current_fs_time
globally, this either has to be rewritten first to avoid the use of
local variables, or it needs temporary conversion helpers, or
it has to be changed in the same patch. None of those is particularly
appealing. There are a few dozen such things in various file systems.

> > > I think it's a mix - if the timestamps come in from userspace,
> > > fail with ERANGE. That could be controlled by sysctl via VFS
> > > part of the ->setattr operation, or in each of the individual FS
> > > implementations. If they come from the kernel (e.g. atime update) then
> > > the generic behvaiour is to warn and continue, filesystems can
> > > otherwise select their own policy for kernel updates via
> > > ->update_time.
> >
> > I'd prefer not to have it done by the individual file system
> > implementation, so we get a consistent behavior.
>
> Can't be helped, because different filesystems have different
> timestamp behaviours, and not all use the generic VFS code for
> timestamp updates. The filesystems need to use the correct helper
> functions to obtain a valid current time, but you can't stop them
> from storing and using arbitrary timestamp formats if they so
> desire...

I mean the decision whether to clamp or error on an overflow
should be done consistently. Having a global sysctl knob or
a compile-time option is better than having each file system
implementor take a guess at what users might prefer, if we can't
come up with a behavior (e.g. clamp all the time, or error out
all the time) that everybody agrees is always correct.

> > > > d. disable expired fs at compile time (Arnd's suggestion)
> > >
> > > Not really an option, because it means we can't use filesystems that
> > > interop with other systems (e.g. cameras, etc) because they won't
> > > support y2038k timestamps for a long time, if ever (e.g. vfat).
> >
> > Let me clarify what my idea is here: I want a global kernel option
> > that disables all code that has known y2038 issues. If anyone tries
> > to build an embedded system with support beyond 2038, that should
> > disable all of those things, including file systems, drivers and
> > system calls, so we can reasonably assume that everything that works
> > today with that kernel build will keep working in the future and
> > not break in random ways.
>
> It's not that black and white when it comes to filesystems. y2038k
> support is determined by the on-disk structure of the filesystem
> being mounted, and that is determined at mount time. When the
> filesystem mounts and sets it's valid timestamp ranges the VFS
> will need to decide as to whether the filesystem is allowed to
> continue mounting or not.

Some file systems are always broken around 2038 (e.g. HFS in 2040),
so if we can't fix them, I want to be able to turn them off
in Kconfig along with the 32-bit time_t syscalls. For those
where y2038 support depends on on-disk feature flags (ext4 and
xfs I guess, maybe one or two more), the config option can
turn off write support for the old format.

This is a rather important part of the y2038 work: If anyone cares about
y2038 problems in this decade, they want to deploy systems with extremely
long service contracts and don't want to update them 20 years from now
to fix an obscure bug that could be prevented today, so we try hard to
identify every line of code that won't work then as it does today.

> > For a file system, this can be done in a number of ways:
> >
> > * Most file systems today interpret the time as an unsigned 32-bit
> > number (as opposed to signed as ext3, xfs and few others do),
> > so as long as we use timespec64 in the syscalls, we are ok.
>
> Actually, sign conversion is a problem we currently have to be very
> careful of. See, for example, xfstests:tests/generic/258, which
> tests timestamps recording times before epoch. i.e. in XFS we have
> to convert the unsigned 32 bit disk timestamp to signed 32 bit
> before storing it in the VFS timestamp so it behaves correctly on 64
> bit systems. This results in us needing to do this when reading the
> inode from disk:
>
> + /*
> + * time is signed, so need to convert to signed 32 bit before
> + * storing in inode timestamp which may be 64 bit. Otherwise
> + * a time before epoch is converted to a time long after epoch
> + * on 64 bit systems.
> + */
> + inode->i_atime.tv_sec = (int)be32_to_cpu(from->di_atime.t_sec);
> + inode->i_atime.tv_nsec = (int)be32_to_cpu(from->di_atime.t_nsec);
> + inode->i_mtime.tv_sec = (int)be32_to_cpu(from->di_mtime.t_sec);
> + inode->i_mtime.tv_nsec = (int)be32_to_cpu(from->di_mtime.t_nsec);
> + inode->i_ctime.tv_sec = (int)be32_to_cpu(from->di_ctime.t_sec);
> + inode->i_ctime.tv_nsec = (int)be32_to_cpu(from->di_ctime.t_nsec);
> +
> (http://oss.sgi.com/archives/xfs/2016-01/msg00456.html)

I'm very aware of this issue. Most file system developers however were
not, so e.g. a timestamp on btrfs is interpreted differently on 32-bit
and 64-bit kernels. Some file systems (AFS, NFSv3?) explicitly define
the timestamps as unsigned, and most others that store 32-bit seconds
apparently never thought about the issue and happen to use unsigned
interpretation on 64-bit systems, since that is what you get out of
be32_to_cpu() without adding the cast.

For the y2038 case, this means we are lucky: almost all the users today
have 64-bit hardware, so they can already represent timestamps in the
range from 1970 to 2106. Once we have 64-bit timestamps in 32-bit user
space, we just need to make that use the same format we use on the
64-bit machines already.

ext2/3/4, xfs and ocfs2 (maybe one or two more, I'd have to check)
currently behave in a consistent manner across 32-bit and 64-bit
architectures by allowing a range between 1902 and 2037, and we
obviously don't have a choice there but to keep that current
behavior, and extend the time format in one way or another to
store additional bits for the epoch.

> > * Some legacy file systems (maybe hfs) can remain disabled, as
> > nobody cares about them any more.
> >
> > * If we still care about them (e.g. ext2), we can make them support
> > only read-only mode. In ext4, this would mean forbidding write
> > access to file systems that don't have the extended inode format
> > enabled.
>
> For ext2/4, that would have to be handled internally by the
> filesystem with feature masks. For other legacy filesystems, then
> the VFS mount time checking could allow RO mounts if the supported
> ranges are not y2038k clean. Compile time options are not
> really the best approach here...

I'm not following the line of thought here. We have some users
that want ext4 to mount old file system images without long
inodes writable, because they don't care about the 2038 problem.
We also have other users that want to force the same file system
image to be read-only because they want to ensure that it does
not stop working correctly when the time overflow happens while
the fs is mounted.

If you don't want a compile-time option for it, how do you suggest
we decide which case we have?

Arnd