Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

From: NeilBrown
Date: Tue Apr 04 2017 - 21:44:45 EST


On Tue, Apr 04 2017, J. Bruce Fields wrote:

> On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > Because if above is acceptable we could make reported i_version to be a sum
>> > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
>> > > > That way after a crash we are guaranteed each inode will report new
>> > > > i_version (the sum would probably have to look like "superblock crash
>> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > Thoughts?
>> >
>> > How hard is this for filesystems to support? Do they need an on-disk
>> > format change to keep track of the crash counter? Maybe not, maybe the
>> > high bits of the i_version counters are all they need.
>> >
>>
>> Yeah, I imagine we'd need a on-disk change for this unless there's
>> something already present that we could use in place of a crash counter.
>
> We could consider using the current time instead. So, put the current
> time (or time of last boot, or this inode's ctime, or something) in the
> high bits of the change attribute, and keep the low bits as a counter.

This is a very different proposal.
I don't think Jan was suggesting that the i_version be split into two
bit fields, one the change-counter and one the crash-counter.
Rather, the crash-counter was multiplied by a large-number and added to
the change-counter with the expectation that while not ever
change-counter landed on disk, at least 1 in every large-number would.
So after each crash we effectively add large-number to the
change-counter, and can be sure that number hasn't been used already.

To store the crash-counter in each inode (which does appeal) you would
need to be able to remove it before adding the new crash counter, and
that requires bit-fields. Maybe there are enough bits.

If you want to ensure read-only files can remain cached over a crash,
then you would have to mark a file in some way on stable storage
*before* allowing any change.
e.g. you could use the lsb. Odd i_versions might have been changed
recently and crash-count*large-number needs to be added.
Even i_versions have not been changed recently and nothing need be
added.

If you want to change a file with an even i_version, you subtract
crash-count*large-number
to the i_version, then set lsb. This is written to stable storage before
the change.

If a file has not been changed for a while, you can add
crash-count*large-number
and clear lsb.

The lsb of the i_version would be for internal use only. It would not
be visible outside the filesystem.

It feels a bit clunky, but I think it would work and is the best
combination of Jan's idea and your requirement.
The biggest cost would be switching to 'odd' before an changes, and the
unknown is when does it make sense to switch to 'even'.

NeilBrown

Attachment: signature.asc
Description: PGP signature