Re: [PATCH v2 15/30] xfs: Define usercopy region in xfs_inode slab cache
From: Christoph Hellwig
Date: Wed Aug 30 2017 - 03:14:13 EST
On Wed, Aug 30, 2017 at 07:51:57AM +1000, Dave Chinner wrote:
> Right, I've looked at btrees, too, but it's more complex than just
> using an rbtree. I originally looked at using Peter Z's old
> RCU-aware btree code, but it doesn't hold data in the tree leaves.
> So that needed significant modification to make work without a
> memory alloc per extent and that didn't work with original aim of
> RCU-safe extent lookups. I also looked at that "generic" btree
> stuff that came from logfs, and after a little while ran away
> screaming.
I started with the latter, but it's not really looking like it any more:
there nodes are formatted as a series of u64s instead of all the
long magic, and the data is stored inline - in fact I use a cute
trick to keep the size down, derived from our "compressed" on disk
extent format:
Key:
+-------+----------------------------+
| 00:51 | all 52 bits of startoff |
| 52:63 | low 12 bits of startblock |
+-------+----------------------------+
Value
+-------+----------------------------+
| 00:20 | all 21 bits of length |
| 21 | unwritten extent bit |
| 22:63 | high 42 bits of startblock |
+-------+----------------------------+
So we only need a 64-bit key and a 64-bit value by abusing parts
of the key to store bits of the startblock.
For non-leaf nodes we iterate through the keys only, never touching
the cache lines for the value. For the leaf nodes we have to touch
the value anyway because we have to do a range lookup to find the
exact record.
This works fine so far in an isolated simulator, and now I'm ammending
it to be a b+tree with pointers to the previous and next node so
that we can nicely implement our extent iterators instead of doing
full lookups.
> The sticking point, IMO, is the extent array index based lookups in
> all the bmbt code. I've been looking at converting all that to use
> offset based lookups and a cursor w/ lookup/inc/dec/insert/delete
> ioperations wrapping xfs_iext_lookup_ext() and friends. This means
> the modifications are pretty much identical to the on-disk extent
> btree, so they can be abstracted out into a single extent update
> interface for both trees. Have you planned/done any cleanup/changes
> with this code?
I've done various cleanups, but I've not yet consolidated the two.
Basically step one at the moment is to move everyone to
xfs_iext_lookup_extent + xfs_iext_get_extent that removes all the
bad intrusion.
Once we move to the actual b+trees the extnum_t cursor will be replaced
with a real cursor structure that contains a pointer to the current
b+tree leaf node, and an index inside that, which will allows us very
efficient iteration. The xfs_iext_get_extent calls will be replaced
with more specific xfs_iext_prev_extent, xfs_iext_next_extent calls
that include the now slightly more complex cursor decrement, increment
as well as a new xfs_iext_last_extent helper for the last extent
that we need in a few places.
insert/delete remain very similar to what they do right now, they'll
get a different cursor type, and the manual xfs_iext_add calls will
go away. The new xfs_iext_update_extent helper I posted to the list
yesterday will become a bit more complex, as changing the startoff
will have to be propagated up the tree.