bcachefs status update - current and future work

From: Kent Overstreet
Date: Thu Nov 04 2021 - 13:36:00 EST


Time to try and summarize everything that's been going on in bcachefs land since
my last lkml posting, and my thoughts on what's next.

Core btree improvements:
- Updates to interior btree nodes are now journalled.

- We're now updating parent btree node pointers on every btree write. This was
a pretty major improvement - it means we can now always detect lost btree
btree writes, which was a hole in encrypted mode and also turned out to be a
robustness issue in RAID mode. It also means we can start to drop the journal
sequence number blacklist mechanism and closed some rare corner case issues.
And thanks to the previous item, it didn't cost us any performance.

- We no longer have to mark every journal write as flush/fua - stole this idea
from XFS, it was a pretty nice performance improvement.

- Lots of btree locking improvements: notably, we now have assertions that we
never hold btree locks while doing IO. This is really good for tail latency.

- The transaction model is steadily improving and gaining more and more
assertions; this makes it easier to write upper level FS code without
worrying about locking considerations. We've started requiring every btree
transaction to start with bch2_trans_begin(), and in particular there's
asserts that this is the next thing called after a transaction restart.
Catching random little bugs with new assertions is a good feeling.

- The btree iterator code has now been split up into btree_iter and btree_path;
btree_path implements the "path to a particular position in the btree" code,
and btree_iter sits on top of that and implements iteration over keys,
iteration over slots, iteration over extents, iteration for snapshots (that's
a whole thing), and more - this refactoring came about during the work for
snapshots and it turned out really nicely.

Recovery:
- All alloc info is now updated fully transactionally. Originally we'd have to
regenerate alloc info on every mount, then after every unclean shutdown -
then for a long time we only had to regenerate alloc info for metadata after
unclean shutdown. With updates to interior btree nodes being fully
journalled, that makes updates to alloc info fully transactional and our
mount times fast.

Currently we still have to read all alloc info into memory on mount, but that
too will be changing.

Features:
- Reflink: I believe all the bugs have finally been shaken out. The last bug to
be found was a refcount leak when we fragmented an existing indirect extent
(by copygc/rebalance), and a reflink pointer only pointed to part of it.

- Erasure coding - we're still popping some silly assertions, it's on my todo
list

- Encryption: people keep wanting AES support, so at some point I'll try and
find the time to add AES/GCM.

- SNAPSHOTS ARE DONE (mostly), and they're badass.

I've successfully gotten up to a million snapshots (only changing a single
file in each snapshot) in a VM. They scale. Fsck scales. Take as many
snapshots as you want. Go wild.

Still todo:
- need to export a different st_dev for each subvolume, like btrfs, so that
find -xdev does what you want and skips snapshots

- we would like better atomicity w.r.t. pagecache on snapshot creation, and
it'd be nice if we didn't have to do a big sync when creating a snapshot -
we could do this by getting the subvolume's current snapshot ID at buffered
write time, but there's other things that make this hard

- we need per-snapshot ID disk space accounting. This is going to have to
wait for a giant disk space accounting rework though, which will move disk
space accounting out of the journal and to a dedicated btree.

- userspace interface is very minimal - e.g. still need to implement
recursive snapshotting.

- quota support is currently disabled, because of interactions with
snapshots; re-enabling that is high on my todo list.

- the btree key cache is currently disabled for inodes, also because of
interactions with snapshots: this is a performance regression until we get
this solved.

About a year of my life went into snapshots and I'm _really_ proud with how they
turned out - in terms of algorithmic complexity, snapshots has been the biggest
single feature tackled and when I started there were a lot of big unknowns that
I honestly wasn't sure I was going to find solutions for. Still waiting on
more people to start really testing with them and banging on them (and we do
still need more tests written) but so far shaking things out has gone really
smoothly (more smoothly than erasure coding, that's for sure!)

FUTURE WORK:

I'm going to start really getting on people for review and working on
upstreaming this beast. I intend for it to be marked EXPERIMENTAL for awhile,
naturally - there are still on disk format changes coming that will be forced
upgrades. But getting snapshots done was the big goal I'd set for myself, so
it's time.

Besides that, my next big focus is going to be on scalability. bcachefs was
hitting 50 TB volumes even before it was called bcachefs - I fully intend for it
to scale to 50 PB. To get there, we need to:

- Get rid of the in-memory bucket array. We're mostly there, all allocation
information lives in the btree, but we need to make more improvements to the
btree representation before we can ditch the pure in memory representation.

- We need new persistent data structures for the allocator, so that the
allocator doesn't have to scan buckets. First up will be implementing a
persistent LRU, then probably a free space btree.

- We need a backpointers btree, so that copygc doesn't have to scan the
extents/reflink btrees.

- Online fsck. This will come in stages: first, theres's the filesystem level
fsck code in fs/bcachefs/fsck.c. The recent work improving the btree
transaction layer and adding assertions there has been forcing the fsck code
to change to be more rigorously correct (in the context of running
concurrently with other filesytem operations); a lot of that code is most of
the way there now. We'll need additional locking vs. other filesystem code
for the directory structure and inode nlinks passes, but shouldnt't for the
rest of the passes.

After fsck.c is running concurrently, it'll be time to bring back concurrent
btree gc, which regenerates alloc info. Woohoo.


-------------
End brain dump, thank you kindly for reading :)