Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

From: Kent Overstreet
Date: Wed Feb 28 2024 - 23:10:36 EST


On Tue, Feb 27, 2024 at 10:55:02AM -0500, Brian Foster wrote:
> On Sat, Feb 24, 2024 at 09:38:06PM -0500, Kent Overstreet wrote:
> > Main part of the disk accounting rewrite.
> >
> > This is a wholesale rewrite of the existing disk space accounting, which
> > relies on percepu counters that are sharded by journal buffer, and
> > rolled up and added to each journal write.
> >
> > With the new scheme, every set of counters is a distinct key in the
> > accounting btree; this fixes scaling limitations of the old scheme,
> > where counters took up space in each journal entry and required multiple
> > percpu counters.
> >
> > Now, in memory accounting requires a single set of percpu counters - not
> > multiple for each in flight journal buffer - and in the future we'll
> > probably also have counters that don't use in memory percpu counters,
> > they're not strictly required.
> >
> > An accounting update is now a normal btree update, using the btree write
> > buffer path. At transaction commit time, we apply accounting updates to
> > the in memory counters, which are percpu counters indexed in an
> > eytzinger tree by the accounting key.
> >
> > Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> > ---
> > fs/bcachefs/alloc_background.c | 68 +++++-
> > fs/bcachefs/bcachefs.h | 6 +-
> > fs/bcachefs/bcachefs_format.h | 1 -
> > fs/bcachefs/bcachefs_ioctl.h | 7 +-
> > fs/bcachefs/btree_gc.c | 3 +-
> > fs/bcachefs/btree_iter.c | 9 -
> > fs/bcachefs/btree_trans_commit.c | 62 ++++--
> > fs/bcachefs/btree_types.h | 1 -
> > fs/bcachefs/btree_update.h | 8 -
> > fs/bcachefs/buckets.c | 289 +++++---------------------
> > fs/bcachefs/buckets.h | 33 +--
> > fs/bcachefs/disk_accounting.c | 308 ++++++++++++++++++++++++++++
> > fs/bcachefs/disk_accounting.h | 126 ++++++++++++
> > fs/bcachefs/disk_accounting_types.h | 20 ++
> > fs/bcachefs/ec.c | 24 ++-
> > fs/bcachefs/inode.c | 9 +-
> > fs/bcachefs/recovery.c | 12 +-
> > fs/bcachefs/recovery_types.h | 1 +
> > fs/bcachefs/replicas.c | 42 ++--
> > fs/bcachefs/replicas.h | 11 +-
> > fs/bcachefs/replicas_types.h | 16 --
> > fs/bcachefs/sb-errors_types.h | 3 +-
> > fs/bcachefs/super.c | 49 +++--
> > 23 files changed, 704 insertions(+), 404 deletions(-)
> > create mode 100644 fs/bcachefs/disk_accounting_types.h
> >
> ...
> > diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
> > index 209f59e87b34..327c586ac661 100644
> > --- a/fs/bcachefs/disk_accounting.c
> > +++ b/fs/bcachefs/disk_accounting.c
> ...
> > @@ -13,6 +17,44 @@ static const char * const disk_accounting_type_strs[] = {
> > NULL
> > };
> >
>
> So I'm gonna need to stare at all this much more than I have so far, but
> one initial thing that stands out to me is the lack of high level
> function comments. IMO, something that helps tremendously in
> reading/reviewing these sorts of systemic changes is having a a couple
> sentence or so comment at the top of the main/external interfaces just
> to briefly explain what they do in plain english.
>
> So here, something like "modify an accounting key in the btree based on
> <whatever> ..." helps explain what it does and why it's used where it
> is. The same goes for some of the other interface level functions, like
> reading in accounting from disk, updating in-memory accounting (from
> journal entries in committing transactions?), updating the superblock,
> etc. I think I've started to put some of those pieces together, but
> having to jump all through the implementation to piece together high
> level behaviors is significantly more time consuming than having the
> author guide one through the high level interactions.
>
> IOW, I think if you minimally document the functions that are sufficient
> to help understand how accounting works as a black box (somewhere
> beneath the [nice] higher level big comment descriptions of the whole
> thing and above the low level implementation details), that helps the
> reviewer establish an understanding of the mechanism before having to
> dig through the implementation details and also serves as a reference
> going forward for the next person who is in a similar position and wants
> to read/debug/tweak/whatever this code.
>
> > +int bch2_disk_accounting_mod(struct btree_trans *trans,
> > + struct disk_accounting_key *k,
> > + s64 *d, unsigned nr)
> > +{
> > + /* Normalize: */
> > + switch (k->type) {
> > + case BCH_DISK_ACCOUNTING_replicas:
> > + bubble_sort(k->replicas.devs, k->replicas.nr_devs, u8_cmp);
> > + break;
> > + }
> > +
> > + BUG_ON(nr > BCH_ACCOUNTING_MAX_COUNTERS);
> > +
> > + struct {
> > + __BKEY_PADDED(k, BCH_ACCOUNTING_MAX_COUNTERS);
> > + } k_i;
> > + struct bkey_i_accounting *acc = bkey_accounting_init(&k_i.k);
> > +
> > + acc->k.p = disk_accounting_key_to_bpos(k);
> > + set_bkey_val_u64s(&acc->k, sizeof(struct bch_accounting) / sizeof(u64) + nr);
> > +
> > + memcpy_u64s_small(acc->v.d, d, nr);
> > +
> > + return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
> > +}
> > +
> ...
> > diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
> > index a7f9de220d90..685d54d0ddbb 100644
> > --- a/fs/bcachefs/super.c
> > +++ b/fs/bcachefs/super.c
> ...
> > @@ -1618,6 +1621,16 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
> > if (ret)
> > goto err;
> >
> > + /*
> > + * We need to flush the entire journal to get rid of keys that reference
> > + * the device being removed before removing the superblock entry
> > + */
> > + bch2_journal_flush_all_pins(&c->journal);
>
> I thought this needed to occur between the device removal and superblock
> update (according to the comment below). Is that not the case? Either
> way, is it moved for reasons related to accounting?

I think it ended up not needing to be moved, and I just forgot to drop
it - originally I disallowed accounting entries that referenced
nonexistent devices, but that wasn't workable so now it's only nonzero
accounting keys that aren't allowed to reference nonexistent devices.

I'll see if I can delete it.

Applying the following fixup patch, renaming for consistency but mostly
adding documentation. Helpful?