Re: [Y2038] [RFC 02/15] vfs: Change all structures to support 64 bit time

From: Dave Chinner
Date: Thu Jan 14 2016 - 16:00:12 EST


On Thu, Jan 14, 2016 at 05:53:21PM +0100, Arnd Bergmann wrote:
> On Thursday 14 January 2016 08:04:36 Dave Chinner wrote:
> > On Wed, Jan 13, 2016 at 08:33:16AM -0800, Deepa Dinamani wrote:
> > > On Tue, Jan 12, 2016 at 07:29:57PM +1100, Dave Chinner wrote:
> > > > On Mon, Jan 11, 2016 at 09:42:36PM -0800, Deepa Dinamani wrote:
> > > > > > On Jan 11, 2016, at 04:33, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > >> On Wed, Jan 06, 2016 at 09:35:59PM -0800, Deepa Dinamani wrote:
> > >
> > > 2. How to achieve a seamless transition?
> > > Is inode_timespec solution agreed upon to achieve 1a?
> >
> > No. Just convert direct to timespec64.
>
> The hard part here is how to split that change into logical patches
> per file system. We have already discussed all sorts of ways to
> do that, but there is no ideal solution, as you usually end up
> either having some really large patches, or you have to modify
> the same lines multiple times.
>
> The most promising approaches are:
>
> a) In Deepa's current patch set, some infrastructure is first
> introduced by changing the type from timespec to an identical
> inode_timespec, which lets us convert one file system at a time
> to inode_timespec and then change the type once they are all
> done. The downside is then that all file systems have to get
> touched twice so we end up with timespec64 everywhere.

Yes, and the result is not pretty.

> b) A variation of that which I would do is to use have a smaller
> set of infrastructure first, so we can change one file system
> at a time to timespec64 while leaving the common structures to
> use timespec until all file systems are converted. The downside
> is the use of some conversion macros when accessing the times
> in the inode.
> When the common code is changed, those accessor macros get
> turned into trivial assignments that can be removed up later
> or changed in the same patch.

Doesn't really make a lot of sense to me. We have to change
everything evntually, and it's not that much work to do so up
front...

> c) The opposite direction from b) is to first change the common
> code, but then any direct assignment between a timespec in
> a file system and the timespec64 in the inode/iattr/kstat/etc
> first needs a conversion helper so we can build cleanly,
> and then we do one file system at a time to remove them all
> again while changing the internal structures in the
> file system from timespec to timespec64.

No new helpers are necessary - we've already got the helper
functions we need. This:

int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
{
struct inode *inode = d_inode(old_dentry);
+ struct inode_timespec now = current_fs_time(inode->i_sb);

- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ VFS_INODE_SET_XTIME(i_ctime, inode, now);
+ VFS_INODE_SET_XTIME(i_mtime, dir, now);
+ VFS_INODE_SET_XTIME(i_ctime, dir, now);
inc_nlink(inode);
.....

is just wrong. All the type conversion and clamping and checking
done in that VFS_INODE_SET_XTIME() should be done in
current_fs_time() and have it return a timespec64 directly. Indeed,
it already does truncation, and can easily be made to do range
clamping, too. i.e. the change should simply be:

- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = current_fs_time(inode->i_sb);

This will work cleanly with a s/timespec/timespec64/ API changeover,
too, without the need for any additional helpers at all.

i.e. the first thing to do is make sure everything that
creates/converts timestamps uses/returns a struct timespec that is
clamped and truncated at the earliest possible opportunity. i.e. on
entry to the kernel and when pulled from the kernel time. Once all
the code is using and passing around truncated+clamped struct
timespec (like above), then we can simply do a
s/timespec/timespec64/ on the inode and API, and we're done.

> > I think it's a mix - if the timestamps come in from userspace,
> > fail with ERANGE. That could be controlled by sysctl via VFS
> > part of the ->setattr operation, or in each of the individual FS
> > implementations. If they come from the kernel (e.g. atime update) then
> > the generic behvaiour is to warn and continue, filesystems can
> > otherwise select their own policy for kernel updates via
> > ->update_time.
>
> I'd prefer not to have it done by the individual file system
> implementation, so we get a consistent behavior.

Can't be helped, because different filesystems have different
timestamp behaviours, and not all use the generic VFS code for
timestamp updates. The filesystems need to use the correct helper
functions to obtain a valid current time, but you can't stop them
from storing and using arbitrary timestamp formats if they so
desire...

> > > d. disable expired fs at compile time (Arnd's suggestion)
> >
> > Not really an option, because it means we can't use filesystems that
> > interop with other systems (e.g. cameras, etc) because they won't
> > support y2038k timestamps for a long time, if ever (e.g. vfat).
>
> Let me clarify what my idea is here: I want a global kernel option
> that disables all code that has known y2038 issues. If anyone tries
> to build an embedded system with support beyond 2038, that should
> disable all of those things, including file systems, drivers and
> system calls, so we can reasonably assume that everything that works
> today with that kernel build will keep working in the future and
> not break in random ways.

It's not that black and white when it comes to filesystems. y2038k
support is determined by the on-disk structure of the filesystem
being mounted, and that is determined at mount time. When the
filesystem mounts and sets it's valid timestamp ranges the VFS
will need to decide as to whether the filesystem is allowed to
continue mounting or not.

> For a file system, this can be done in a number of ways:
>
> * Most file systems today interpret the time as an unsigned 32-bit
> number (as opposed to signed as ext3, xfs and few others do),
> so as long as we use timespec64 in the syscalls, we are ok.

Actually, sign conversion is a problem we currently have to be very
careful of. See, for example, xfstests:tests/generic/258, which
tests timestamps recording times before epoch. i.e. in XFS we have
to convert the unsigned 32 bit disk timestamp to signed 32 bit
before storing it in the VFS timestamp so it behaves correctly on 64
bit systems. This results in us needing to do this when reading the
inode from disk:

+ /*
+ * time is signed, so need to convert to signed 32 bit before
+ * storing in inode timestamp which may be 64 bit. Otherwise
+ * a time before epoch is converted to a time long after epoch
+ * on 64 bit systems.
+ */
+ inode->i_atime.tv_sec = (int)be32_to_cpu(from->di_atime.t_sec);
+ inode->i_atime.tv_nsec = (int)be32_to_cpu(from->di_atime.t_nsec);
+ inode->i_mtime.tv_sec = (int)be32_to_cpu(from->di_mtime.t_sec);
+ inode->i_mtime.tv_nsec = (int)be32_to_cpu(from->di_mtime.t_nsec);
+ inode->i_ctime.tv_sec = (int)be32_to_cpu(from->di_ctime.t_sec);
+ inode->i_ctime.tv_nsec = (int)be32_to_cpu(from->di_ctime.t_nsec);
+
(http://oss.sgi.com/archives/xfs/2016-01/msg00456.html)

> * Some legacy file systems (maybe hfs) can remain disabled, as
> nobody cares about them any more.
>
> * If we still care about them (e.g. ext2), we can make them support
> only read-only mode. In ext4, this would mean forbidding write
> access to file systems that don't have the extended inode format
> enabled.

For ext2/4, that would have to be handled internally by the
filesystem with feature masks. For other legacy filesystems, then
the VFS mount time checking could allow RO mounts if the supported
ranges are not y2038k clean. Compile time options are not
really the best approach here...

> > If there are places where filesystems are receiving or using
> > unchecked timestamps then those are bugs that need fixing. Those
> > need to be in separate patches to y2038k support...
>
> Fair enough, but that probably means that patch series will have to
> come first.

Yes, that is normal practice for structuring a non-trivial feature
addition patch series. Bug fixes first, then cleanups, then the
functionality changes.

> This will also reduce the number of places in which a
> separate type conversion function needs to be added.

Precisely. I'm pretty sure this should come down to "no new
conversion functions needed" for the vast majority of the code.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx