Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
From: Matthew Wilcox
Date: Tue Sep 22 2020 - 13:26:12 EST
On Tue, Sep 22, 2020 at 12:46:05PM -0400, Mikulas Patocka wrote:
> I agree that the b+tree were a good choice for XFS.
>
> In RAM-based maps, red-black trees or avl trees are used often. In
> disk-based maps, btrees or b+trees are used. That's because in RAM, you
> are optimizing for the number of cache lines accessed, and on the disk,
> you are optimizing for the number of blocks accessed.
That's because software hasn't yet caught up to modern hardware. An
in-memory btree outperforms an rbtree for several reasons:
- rbtrees are taller; the average height of a b-tree with even just
8 pointers per level is about half of an rbtree.
- Not all cachelines are created equal. The subsequent cacheline
of this cacheline is probably already in cache. If not, it's on its
way and will be here shortly. So a forward scan of this node will be
quicker than finding another level of tree. Our experiments have found
a performance improvement with 256-byte B-tree nodes over an rbtree.
- btrees are (or can be) RCU-safe. It's very hard to make an rbtree
RCU safe. You can do a seqlock to get most of the benefit, but btrees
let you allocate a new node and copy into it, while rbtrees embed the
node in the object.
The downside, of course, is that b-trees require external allocations
while rbtrees embed the node in the object. So you may need to do
more allocations. But filesystems love doing additional allocations;
they make journalling so much easier.
> BTW. How does XFS "predict" the file size? - so that it allocates extent
> of proper size without knowing how big the file will be?
XFS does delayed allocation. The page cache absorbs the writes and then
at writeback time, XFS chooses where on storage to put it. Some things
break this like O_SYNC and, er, DAX, but it's very effective.
> > The NVFS indirect block tree has a fan-out of 16,
>
> No. The top level in the inode contains 16 blocks (11 direct and 5
> indirect). And each indirect block can have 512 pointers (4096/8). You can
> format the device with larger block size and this increases the fanout
> (the NVFS block size must be greater or equal than the system page size).
>
> 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can
> map 256 TiB and 5 levels can map 128 PiB.
But compare to an unfragmented file ... you can map the entire thing with
a single entry. Even if you have to use a leaf node, you can get four
extents in a single cacheline (and that's a fairly naive leaf node layout;
I don't know exactly what XFS uses)
> > Rename is another operation that has specific "operation has atomic
> > behaviour" expectations. I haven't looked at how you've
> > implementated that yet, but I suspect it also is extremely difficult
> > to implement in an atomic manner using direct pmem updates to the
> > directory structures.
>
> There is a small window when renamed inode is neither in source nor in
> target directory. Fsck will reclaim such inode and add it to lost+found -
> just like on EXT2.
... ouch. If you have to choose, it'd be better to link it to the second
directory then unlink it from the first one. Then your fsck can detect
it has the wrong count and fix up the count (ie link it into both
directories rather than neither).
> If you think that the lack of journaling is show-stopper, I can implement
> it. But then, I'll have something that has complexity of EXT4 and
> performance of EXT4. So that there will no longer be any reason why to use
> NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may
> attract some users who want good performance and who don't care about GID
> and UID being updated atomically, etc.
Well, what's your intent with nvfs? Do you already have customers in mind
who want to use this in production, or is this somewhere to play with and
develop concepts that might make it into one of the longer-established
filesystems?