"Chris Siebenmann wrote:"
> You write:
> | but I'd like to know if we expect a file of fixed size which is being
> | overwritten without O_TRUNC to have any metadata changes apart from in
> | its inode (and there trivially) ...?
>
> It depends on the filesystem. A 'traditional' Unix filesystem (original
> V7 or Berkeley FFS derived) will not. A journaling filesystem that is
> journaling the data will write to the log. Something like Reiserfs may
OK.
> I think I have an alternative paper design for what you want, though.
Let's go ...
> Hack up your chosen underlying filesystems to understand two additional
> mount flags: REWRITEO and EXTENDO. A filesystem mounted REWRITEO allows
> all read operations and only write operations to already allocated file
> blocks (and it does not update inode mtime when such writes happen). A
OK. RWO means "overwrite allowed".
> filesystem mounted EXTENDO allows REWRITEO operations and files to be
> extended, but no other write operations; writes under EXTENDO update
> inode metadata as they normally would.
Hmm.
> Define two new internal errnos, returned by the filesystem to mean
> 'operation requires EXTENDO mount' or 'operation requires full write
> mount'.
>
> Create an overlay pseudo-filesystem type (you could hack this into
> the VFS, but it's simpler to make it a new filesystem), and a user
> level helper for lock management. This pseudo-filesystem forwards
> VFS operations to an actual underlying filesystem, traps and handles
> operations that require changing the underlying filesystem's mount
> options, returns the results to the user with any editing they need, and
> handles lock state transitions.
Well, not a bad idea anyway to try an overlay first. That seems to
counter most objections I've heard on its own!
> IO to the underlying filesystem is done O_DIRECT, to bypass caching
> both ways. Because the overlay filesystem does the actual opens, it can
> transparently add O_DIRECT to the open flags. The overlay filesystem
Well, that was easy to do anyway. I hacked the VFS mount calls to
support a MNT_DIRECT flag and hacked sys_open to notice that flag on
the mount when it was called, and do an O_DIRECT open.
> needs to trap opens (and closes) in order to keep track of what files
> are open on the underlying filesystem; I suspect it needs to dummy up
> file objects and do some forwarding there in order to keep track of
> everything.
All we will do at the end of the day is close the file!
I don't see what needs tracking until then ...
> The underlying filesystem is normally mounted REWRITEO on all nodes.
> A single instance may be mounted EXTENDO while the others continue to
Oh, OK.
> be in REWRITEO. Full write is only allowed when no one else is using
> the filesystem at al. This is all managed by a lock manager server
> for the disk store, which talks to clients on each node using the
> particular store.
>
> When a node requests EXTENDO, the lock manager verifies that everyone
> else is in REWRITEO or tells the node to stall on that until everyone
> else is. When a node requests full write, the lock manager asks
Hmm. This can starve.
> all other nodes to temporarily unmount the filesystem and stall IO
That's because you don't have access to the dcache entries for the
underlying fs? I think one can get them in a finer grained way. One can
certainly vamoosh them all at once - there's a call for that already.
It walks the dcache and kills anything pointing to the right system.
> operations on it; when full write is released, the lock manager tells
> everyone they can go to REWRITEO and start IO again. When a node joins
> the lock manager, it asks for REWRITEO and the lock manager verifies
> that no one is in write mode at the moment before saying 'go ahead'.
Yes.
> On transitions between states, the kernel overlay filesystem closes
> down all references held (open files, etc) to the underlying filesystem,
> unmounts it (optimization: some transitions can be done by remount, for
This is because you can't get at the underlying dcache easily.
> example EXTENDO -> REWRITEO), and then when the user level lock manager
> says it's okay remounts the filesystem with the new mount. It must then
> re-obtain all the underlying filesystem inodes and file references it
> was using. There are two ways:
Really? Why? Can't we just lose our own dcace as well?
> - you can steal code from the NFS server, which only works on some
> filesystems because it assumes constant inode numbers over the
> lifetime of the filesystem. (For example, for a while it didn't
> work on Reiserfs.)
Well, I feel bound to comment that the fact that NFS didn't work
universally for a while didn't seem to stop people wanting to use it!
I'd be quite happy to make that assumption, and let RFS worry about it!
> - you pull the filenames of open files from the dentry reference
> you're holding, and reopen them. If it fails, mark the file as
Oh, I see, that's why you wanted them.
> errored-out and return ESTALE for all further IO against it.
> [Somewhat hazardous, since the filename may now point to a
> different file.]
Well, dunno.
> When an EXTENDO client drops down to REWRITEO, the disk store lock
> manager must kick all clients to revalidate the filesystem by executing
> a null state transition (REWRITEO to REWRITEO). This insures that they
> immediately see the full size of the newly extended file.
Hmm. OK. They might want to wait until they need to know, but that's
OK.
> When an operation fails because it needs EXTENDO or full write, the
> pseudo filesystem layer stalls the request and notifies the user level
> process that it needs a lock at the relevant level. The user level
> process goes off to negociate this with the server, which will call back
> to other clients as necessary and then notify this client that it can
> go ahead. When the mount is upgraded to the needed level, the operation
> goes forward.
>
> Unmounting the underlying filesystem on lock state changes means that
> you flush all metadata automatically. By having REWRITEO, we know that
> we can safely cache metadata -- no one is going to be changing it by the
> rules of the game.
I might try this.
> EXTENDO is a wart, and it may be worth eliminating it; as it is,
> some clients may see the newly allocated space even before EXTENDO is
> dropped, but some may not. This design assumes that it's okay to not
> necessarily let other people at the data until the extending client
> drops the lock.
>
> This design is inefficient if there are many full write operations;
Well, of course. The point i sthat it allows aordinary caching
normally, and then causes all caches to be dropped whenever anyone
anywhere does anything that might cause some metadatachange somwhere.
I think one can be more exact than that, but it's OK as a tryout.
> that would be creating files, creating directories, renaming files,
> removing files, etc. But if it is mostly reading and rewriting (the
> assumptions I've seen) it should go very nicely.
>
> The design does assume that the inode access and modification times
They are unimportant.
> are unimportant. I don't think you can get good performance without
> this assumption.
Correct.
> Hopefully this is clear enough.
It is. Thank you.
Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Sep 07 2002 - 22:00:32 EST