Tux3 gets a high speed atom smasher

From: Daniel Phillips
Date: Wed Sep 24 2008 - 22:24:28 EST

Next message: Greg KH: "Re: [PATCH 1/2] VMware guest detection for x86 and x86-64"
Previous message: Jeff Garzik: "Re: [NET-NEXT PATCH] e1000: remove unused Kconfig option for disablingpacket split"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi filesystem aficionados,

The Large Hadron Collider no longer has a monopoly on smashing atoms,
now that Tux3 has weighed in with a bit of its own atom smashing.
Probably the coolest idea that has seen the light of day so far in Tux3
is the concept of xattr atoms: numeric tokens that stand for xattr
names. In contrast to pretty much every other filesystem out there,
Tux3 looks up an xattr in two stages:

1) Look up the attribute name in the atom table
2) Search for the atom in the inode's xattr cache

The sticky issue here is that there is no limit on the number of
different xattr names that are possible, which means that there is no
limit on the size of the atom table. The fact that the number of
distinct names used in practice tends to be very small is,
unfortunately, insufficient justification to ignore the fact that
nothing prevents the atom table from growing without bound. Unless
preventative measures are taken, any unprivileged user could carry out
a denial of service attack simply by creating and deleting random
xattrs on a file until the entire disk volume is filled with unused
xattr names.

Addressing this problem required a lot of soul searching and wild design
banter. Finally we convinced ourselves that the only way to avoid the
alligators in that marsh would be to implement full blown persistent
reference counting for xattr names. A reverse mapping table would also
be necessary in order to be able to implement listxattr efficiently.
Though these things sound scary and complex, it ended up being
implemented in a really small amount of nicely efficient code.

Here is the main design thread:

http://kerneltrap.org/mailarchive/tux3/2008/9/12/3274834/thread
"More xattr design details"

and here we obsess over the tradeoffs at a level of detail I will spare
you from here:

http://tux3.org/pipermail/tux3/2008-September/000143.html
"The long and short of extended attributes"

What is missing from those threads is the end result, alluded to here:

http://tux3.org/pipermail/tux3/2008-September/000186.html
"Atom refcounting redux"

And which I can state more succinctly now that the job is done: the
entire extended attribute support for Tux3 is coded in less than 500
lines, except for two pieces that will land later:

* A hashed cache of xattr names sitting in front of the atom
table.

* Transaction support for atom table, reverse map and refcount
updates.

These two pieces should be about 300 lines at most, or 800 lines total
for xattr support, of which maybe 40% is devoted to resolving and
refcounting xattr names. The effect of the the hash table will be to
tractor beam the xattr lookup performance up from dirop speed
(microseconds) to memhash speed (nanoseconds). Transaction support
will make all the updates including refcounting and reverse mapping
atomic. With the help of Tux3's forward logging accelerator, the
overhead for updating refcounts should be masked nicely by the
unavoidable cost of getting the xattr data on to disk. In the rosiest
scenario, we end up winning by transferring less xattr data to/from
disk due to atoms being much smaller than typical xattr names.

There is also one really esoteric hack going on to reduce the overhead
of logging refcounts: each 32 bit count is divided into two 16 bit
halves, with all the low halves stored on one set of map pages, and the
high halves stored on a different set. Thus, when many xattrs are
being updated at once there will be few carries into the high order
pages and disk bandwidth required to update batches of refcounts will
be reduced by about half. On top of that, we do not update the
refcounts directly, but rather log refcount deltas logically and roll
them up into the disk tables at relatively long intervals. Such log
entries are tucked into a portion of the transaction commit block in
such a way that we get the extra logging more or less for free, by
using otherwise idle commit block space.

I strongly suspect that Tux3 extended attributes will benchmark as well
as existing implementations that just store xattr names directly and
repetitively while requiring significantly less space to do it. If
this proves out in practice (the test is not all that far in the
future) then we are going to be happy about it, and most probably
extend the ideas further. Tux3 atom support is not just about
deduplicating xattr names: it is about deduplication in general. If it
works for xattr names then it will work for bodies as well, and most
likely file data too.

If the implementation of this xattr atom idea had turned out to be a big
rambling mess then I would most likely have just sighed, flushed it,
and gone back to tried and true methods. But it did not turn out to be
a big mess, quite the contrary. This was accomplished by reusing an
existing component: Tux3 atoms are resolved via standard directory
lookup (ext2_find_entry) which serves as a placeholder in Tux3 for a
proper directory index that will arrive later. Instead of storing an
inode number in the directory, we store an atom number.

Another complexity-reduction measure involved mapping three atom-related
tables into the same file:

1) Atom lookup table (Ext2 directory)
2) Atom refcount table
3) Atom reverse map table (for listxattr)

The latter two tables are mapped at block 2^28 in the atom directory
file, which has no particular adverse impact on a btree but may stress
the kernel's page cache radix tree somewhat, which consequently is
forced to be five levels deep as opposed to one or two for ordinary,
non-sparse directory files. I doubt this will show up as a performance
artifact, but if it does there are several ways to fix it, for example:

* Fiddle the page cache radix tree to have a single level of btree
index at the top level for sparse files.

* Keep an array of direct pointers to the atom table when it is
small, which it nearly always will be.

* Store the three tables in three separate inodes, big deal.

Coincidentally, I learned from plumbing the Btrfs repo that Josef Bacik
similarly repurposed a directory mechanism in the direction of xattr
support, but in a distinctly different way: Josef stores xattr bodies
in a variant of the Btrfs directory (if I read it correctly) indexed by
xattr name and inode. On the other hand, Tux3's approach just puts
atom numbers in a single, global directory. Atom numbers happen to
resemble inode numbers so strongly that ext2_find_entry just worked for
this without alteration. We might even use the Ext2 "type" byte field,
intended as an ls accelerate, for some purpose or other. Or maybe not!
After all, the Ext2 directory scheme is just a placeholder in Tux3,
until PHTree lands as described in the original design note:

http://tux3.org/design.html
"PHTree"

It turns out that classic HTree is better for indexing atoms than the
cute new PHTree design because there is no directory traversal and thus
no physical stability requirement. Also there is no requirement for
backward compatibility with Ext2 leaf format, so naturally we will use
a more efficient format with a mini-btree mapped into the leaf block
for that little extra performance edge and, we hope, nearly the same
data density.

Of course, the best thing about the Tux3 atom smashing is that we do not
require low-temperature superconductors, so it should be considerably
more reliable.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Greg KH: "Re: [PATCH 1/2] VMware guest detection for x86 and x86-64"
Previous message: Jeff Garzik: "Re: [NET-NEXT PATCH] e1000: remove unused Kconfig option for disablingpacket split"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]