[RFC] Tux3 for review
From: Daniel Phillips
Date: Fri May 16 2014 - 20:51:11 EST
We would like to offer Tux3 for review for mainline merge. We have
prepared a new repository suitable for pulling:
https://git.kernel.org/cgit/linux/kernel/git/daniel/linux-tux3.git/
Tux3 kernel module files are here:
https://git.kernel.org/cgit/linux/kernel/git/daniel/linux-tux3.git/tree/fs/tux3
Tux3 userspace tools and tests are here:
https://git.kernel.org/cgit/linux/kernel/git/daniel/linux-tux3.git/tree/fs/tux3/user?h=user
Repository
We are moving our development to the kernel.org tree from our standalone
Github repository. Our history was imported from the standalone
repository using git am. Our kernel.org tree is the usual fork of Linus
mainline, with Tux3 kernel files on the master branch and userspace
files in fs/tux3/user on the user branch. We maintain the user files in
our kernel tree because Tux3 has a tighter coupling than usual between
userspace and kernel.
Most of our kernel code also runs in userspace, for testing or as a fuse
filesystem or as part of our userspace support. We also need to keep our
master branch clean of userspace files. These conflicting requirements
creates challenges for our workflow. We can't just merge from user to
master because that would pull in userspace files to kernel, and we
can't merge from master to user because that would pull the entire
kernel history into our branch. The best idea we have come up with is to
cherry-pick changes from user to master and master to user. This creates
merge noise in our user history and requires care to avoid combining
kernel and userspace changes in the same commit. At least, this is
better than having two completely separate repositories. Probably. We
would appreciate any comment on how this workflow could be improved.
For the time being, the subtree at fs/tux3 can also be used standalone.
Run make in fs/tux3 to build a kernel module for the running kernel
version. Run make in fs/tux3/user to build userspace commands including
"tux3 mkfs". Run "make tests" in fs/tux3/user to run our unit tests.
This capability might be useful for people interested in experimenting
with Tux3 in user space, and is handy for a quick build of the user
support without needing to pull the whole repository.
The tux3 command built in fs/tux3/user provides our support tools
including "tux3 mkfs" and "tux3 fsck". For now, we do not build a
standalone mkfs.tux3 and consider that a feature, not a bug, because it
sends the message that Tux3 is for developers right now.
API changes
Tux3 does not implement any custom or extended interfaces.
Core changes
Tux3 builds without any core changes, however we do some unnatural
things to enable that. We would like to have some core changes to clean
this up. One is a correctness issue for mmap and three others are to
clean up ugly workarounds. Without any core changes, mmap will be
disabled because there is a potential for stale cache pages with
combined file and mmap IO. I will describe them here and provide patches
if asked:
1. mmap
Our "page fork" technique does copy-on-write on cache pages in order to
enforce strict delta ordering, which prevents changing pages already
under IO as a side effect. For mmap, we do the page fork in
->page_mkwrite, which needs to be able to change the target page.
Without this ability, we fault twice for each page_mkwrite, and we
cannot close all races. We also have an ugly hack to export a
page_cow_file symbol to our module without patching core.
2. Free a forked page
A forked page that goes out of scope after IO must be freed. We
currently do that in an ugly way by polling for refcount to go to zero.
3. Cgroup interaction
We need some unexported functions to support cgroup.
4. Inode flushing
To enforce strong ordering, we flush inodes in a certain order that core
knows nothing about. Allowing core to flush our inodes using its current
algorithm would cause corruption. We would like a new fs-specific hook
to call our own flushing algorithm. Without that, we replicate part of
the core flushing code to call the tux3 flusher. Code for this is in
commit_flusher.c and commit_flusher_hack.c. Alternatively we can try to
improve the core flusher to meet our needs, or do both: develop a
generic, improved flusher within Tux3 using the hook, test it a lot,
then propose it for core. We would be more than happy to join in the
active effort to improve the core flusher.
Style
We are not perfectly checkpatch clean. We run checkpatch like this:
scripts/checkpatch.pl -f fs/tux3/*.[ch] --ignore
PRINTF_L,C99_COMMENTS,SPLIT_STRING,SUSPECT_CODE_INDENT,LONG_LINE -q
With that, checkpatch still has a few complaints, but not too
many. Our rationale for suppressing some checkpatch complaints:
PRINTF_L: printk supports it. It is shorter and nicer to our eyes.
Checkpatch complains that it is not standard C, but it is not clear
why that matters for kernel code. If anybody cares strongly, we will
change %L to %ll.
C99_COMMENTS: We use them sparingly as a shorthand for "FIXME: <line
where fix is obviously needed>". Will go away as fixes arrive.
SPLIT_STRING: We split some strings to fit in 80 columns. If anybody
hates that, we will change them back to long lines.
SUSPECT_CODE_INDENT: False positives
LONG_LINE: There are a few long lines, where readability would be
worse with splitting. We take our guidance from Linus:
http://yarchive.net/comp/linux/coding_style.html
If we made some line unreadable that way, please let us know and we
will fix it.
Other issues
Declarations after Statements. We have some declarations after
statements, mostly in the userspace code but also some in the kernel
code. We have -Wno-declaration-after-statement in tux3/Makefile to build
without warnings. We think that tasteful use of this C99 extension
improves our code readability and maintainability. We would prefer to
keep these if nobody objects.
Source includes. We include C files in a few places instead of linking
them, typically because it is easier to maintain that way. This
technique is already used in various places in kernel. Can be changed if
necessary.
Fitness for use
Tux3 is not fit for use as of today and will eat your data. The most
glaring deficiency is that Tux3 goes BUG on ENOSPC. Some expected
interfaces are missing. like direct io, xattrs and atime. Some
performance patches are out of tree, to be merged later. This includes
directory indexing, so directories over a few thousand files will slow
to a crawl. Tux3 survives our stress testing, but that does not mean it
will survive your stress testing.
Purpose
We think that Tux3 fills a niche in the Linux ecology where a light,
tight, modern filesystem belongs. We offer a fresh approach to some
ancient problems. Tux3's best trick is strong consistency without the
overhead that you might expect. Our obsession with minimal resource
consumption, including disk space, CPU overhead and cache memory makes
Tux3 promising for personal and embedded use. Tux3's feature set is not
enterprise grade by any stretch of the imagination, but we hope to
accrete some big system features over time. Any of several existing
Linux filesystems already do a nice job of servicing that space, so we
do not need to rush that. Tux3's special mission is to focus on basic
functionality that is really robust, fast and simple.
Quick tour
Tux3 has thirty three c source files and thirteen header files,
comprising about 18 thousand lines. Some files are the familiar ones
from Ext2: balloc.c, dir.c, inode.c, namei.c, super.c and xattr.c.
Our btree code is a generic OOP-like btree class implemented in btree.c.
Subclasses for different btree types are provided by specialized leaf
methods in dleaf.c and ileaf.c, for file data btrees and our inode table
tree, respective. We reuse the ileaf.c methods in orphan.c to store
orphaned inodes.
The main workhorse of Tux3 is filemap.c, which maps between logical and
physical file extents for read and write. This is analogous to
ext2_get_block but more complex because of extents and btrees. This
spreads out over several subfiles for modularity: filemap_blocklib.c,
filemap_hole.c, filemap_mmap.c.
Our delta commit model is implemented in commit.c and its subfiles
commit_flusher.c and commit_flusher_hack.c. This is supported by log.c
and replay.c, to emit log records and replay them on mount. Flushing out
dirty cache is a major Tux3 obsession, implemented in writeback.c and
its subfiles writeback_iattrfork.c, writeback_inodedelete.c and
writeback_xattrfork.c
We use buffers as handles for cache blocks, and have some unique
requirements there, so we have buffer.c with subfiles buffer_fork.c,
buffer_writeback.c, and buffer_writebacklib.c. These implement our block
fork concept. A "bufvec" batching technique translates buffers to bios
for fast IO.
Digression: there might be something generically useful in our buffer
code, however in the long run we would rather replace buffer_head
entirely than try to fix it. Probably, we can save significant CPU and
memory using a framework that specifically provides cache block handles
and not other traditional buffer_head IO functionality. So buffer_head
eradication is in our future work queue and our factoring here reflects
that.
Our scheme for variable sized inodes with optional attributes is
implemented in iattr.c. Block allocation is lightly factored into policy
and mechanism, with the policy bits hived off into policy.c.
Inode_defer.c is a subfile of inode.c and decouples frontend file
creation code from backend inode table updating. In inode_vfslib.c we
duplicate some core kernel code, which will go away if we can export the
proper core functionality as described earlier. Our ugly hack to export
page_cow_file is in mmap_builtin_hack.c. In utility.c we have a few
functions that could possibly become generic.
We encapsulate some of our internal APIs in header files, so we have
quite a few of those. We also have kcompat.h to support building our
module over a range of kernel versions. This will go away but is not
gone yet. In link.h we have a single linked list implementation somewhat
resembling the list.h API. We could possibly replace that by llist.h or
something like it. It is less than a hundred lines so it might be wiser
to just leave it.
Regards,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/