Re: [PATCH 04/21] bcachefs: Disk space accounting rewrite

From: Brian Foster
Date: Tue Feb 27 2024 - 10:53:28 EST


On Sat, Feb 24, 2024 at 09:38:06PM -0500, Kent Overstreet wrote:
> Main part of the disk accounting rewrite.
>
> This is a wholesale rewrite of the existing disk space accounting, which
> relies on percepu counters that are sharded by journal buffer, and
> rolled up and added to each journal write.
>
> With the new scheme, every set of counters is a distinct key in the
> accounting btree; this fixes scaling limitations of the old scheme,
> where counters took up space in each journal entry and required multiple
> percpu counters.
>
> Now, in memory accounting requires a single set of percpu counters - not
> multiple for each in flight journal buffer - and in the future we'll
> probably also have counters that don't use in memory percpu counters,
> they're not strictly required.
>
> An accounting update is now a normal btree update, using the btree write
> buffer path. At transaction commit time, we apply accounting updates to
> the in memory counters, which are percpu counters indexed in an
> eytzinger tree by the accounting key.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> ---
> fs/bcachefs/alloc_background.c | 68 +++++-
> fs/bcachefs/bcachefs.h | 6 +-
> fs/bcachefs/bcachefs_format.h | 1 -
> fs/bcachefs/bcachefs_ioctl.h | 7 +-
> fs/bcachefs/btree_gc.c | 3 +-
> fs/bcachefs/btree_iter.c | 9 -
> fs/bcachefs/btree_trans_commit.c | 62 ++++--
> fs/bcachefs/btree_types.h | 1 -
> fs/bcachefs/btree_update.h | 8 -
> fs/bcachefs/buckets.c | 289 +++++---------------------
> fs/bcachefs/buckets.h | 33 +--
> fs/bcachefs/disk_accounting.c | 308 ++++++++++++++++++++++++++++
> fs/bcachefs/disk_accounting.h | 126 ++++++++++++
> fs/bcachefs/disk_accounting_types.h | 20 ++
> fs/bcachefs/ec.c | 24 ++-
> fs/bcachefs/inode.c | 9 +-
> fs/bcachefs/recovery.c | 12 +-
> fs/bcachefs/recovery_types.h | 1 +
> fs/bcachefs/replicas.c | 42 ++--
> fs/bcachefs/replicas.h | 11 +-
> fs/bcachefs/replicas_types.h | 16 --
> fs/bcachefs/sb-errors_types.h | 3 +-
> fs/bcachefs/super.c | 49 +++--
> 23 files changed, 704 insertions(+), 404 deletions(-)
> create mode 100644 fs/bcachefs/disk_accounting_types.h
>
..
> diff --git a/fs/bcachefs/disk_accounting.c b/fs/bcachefs/disk_accounting.c
> index 209f59e87b34..327c586ac661 100644
> --- a/fs/bcachefs/disk_accounting.c
> +++ b/fs/bcachefs/disk_accounting.c
..
> @@ -13,6 +17,44 @@ static const char * const disk_accounting_type_strs[] = {
> NULL
> };
>

So I'm gonna need to stare at all this much more than I have so far, but
one initial thing that stands out to me is the lack of high level
function comments. IMO, something that helps tremendously in
reading/reviewing these sorts of systemic changes is having a a couple
sentence or so comment at the top of the main/external interfaces just
to briefly explain what they do in plain english.

So here, something like "modify an accounting key in the btree based on
<whatever> ..." helps explain what it does and why it's used where it
is. The same goes for some of the other interface level functions, like
reading in accounting from disk, updating in-memory accounting (from
journal entries in committing transactions?), updating the superblock,
etc. I think I've started to put some of those pieces together, but
having to jump all through the implementation to piece together high
level behaviors is significantly more time consuming than having the
author guide one through the high level interactions.

IOW, I think if you minimally document the functions that are sufficient
to help understand how accounting works as a black box (somewhere
beneath the [nice] higher level big comment descriptions of the whole
thing and above the low level implementation details), that helps the
reviewer establish an understanding of the mechanism before having to
dig through the implementation details and also serves as a reference
going forward for the next person who is in a similar position and wants
to read/debug/tweak/whatever this code.

> +int bch2_disk_accounting_mod(struct btree_trans *trans,
> + struct disk_accounting_key *k,
> + s64 *d, unsigned nr)
> +{
> + /* Normalize: */
> + switch (k->type) {
> + case BCH_DISK_ACCOUNTING_replicas:
> + bubble_sort(k->replicas.devs, k->replicas.nr_devs, u8_cmp);
> + break;
> + }
> +
> + BUG_ON(nr > BCH_ACCOUNTING_MAX_COUNTERS);
> +
> + struct {
> + __BKEY_PADDED(k, BCH_ACCOUNTING_MAX_COUNTERS);
> + } k_i;
> + struct bkey_i_accounting *acc = bkey_accounting_init(&k_i.k);
> +
> + acc->k.p = disk_accounting_key_to_bpos(k);
> + set_bkey_val_u64s(&acc->k, sizeof(struct bch_accounting) / sizeof(u64) + nr);
> +
> + memcpy_u64s_small(acc->v.d, d, nr);
> +
> + return bch2_trans_update_buffered(trans, BTREE_ID_accounting, &acc->k_i);
> +}
> +
..
> diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
> index a7f9de220d90..685d54d0ddbb 100644
> --- a/fs/bcachefs/super.c
> +++ b/fs/bcachefs/super.c
..
> @@ -1618,6 +1621,16 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
> if (ret)
> goto err;
>
> + /*
> + * We need to flush the entire journal to get rid of keys that reference
> + * the device being removed before removing the superblock entry
> + */
> + bch2_journal_flush_all_pins(&c->journal);

I thought this needed to occur between the device removal and superblock
update (according to the comment below). Is that not the case? Either
way, is it moved for reasons related to accounting?

Brian

> +
> + /*
> + * this is really just needed for the bch2_replicas_gc_(start|end)
> + * calls, and could be cleaned up:
> + */
> ret = bch2_journal_flush_device_pins(&c->journal, ca->dev_idx);
> bch_err_msg(ca, ret, "bch2_journal_flush_device_pins()");
> if (ret)
> @@ -1655,17 +1668,6 @@ int bch2_dev_remove(struct bch_fs *c, struct bch_dev *ca, int flags)
>
> bch2_dev_free(ca);
>
> - /*
> - * At this point the device object has been removed in-core, but the
> - * on-disk journal might still refer to the device index via sb device
> - * usage entries. Recovery fails if it sees usage information for an
> - * invalid device. Flush journal pins to push the back of the journal
> - * past now invalid device index references before we update the
> - * superblock, but after the device object has been removed so any
> - * further journal writes elide usage info for the device.
> - */
> - bch2_journal_flush_all_pins(&c->journal);
> -
> /*
> * Free this device's slot in the bch_member array - all pointers to
> * this device must be gone:
> @@ -1727,8 +1729,6 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
> goto err;
> }
>
> - bch2_dev_usage_init(ca);
> -
> ret = __bch2_dev_attach_bdev(ca, &sb);
> if (ret)
> goto err;
> @@ -1793,6 +1793,10 @@ int bch2_dev_add(struct bch_fs *c, const char *path)
>
> bch2_dev_usage_journal_reserve(c);
>
> + ret = bch2_dev_usage_init(ca);
> + if (ret)
> + goto err_late;
> +
> ret = bch2_trans_mark_dev_sb(c, ca);
> bch_err_msg(ca, ret, "marking new superblock");
> if (ret)
> @@ -1956,15 +1960,18 @@ int bch2_dev_resize(struct bch_fs *c, struct bch_dev *ca, u64 nbuckets)
> mutex_unlock(&c->sb_lock);
>
> if (ca->mi.freespace_initialized) {
> - ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets);
> + struct disk_accounting_key acc = {
> + .type = BCH_DISK_ACCOUNTING_dev_data_type,
> + .dev_data_type.dev = ca->dev_idx,
> + .dev_data_type.data_type = BCH_DATA_free,
> + };
> + u64 v[3] = { nbuckets - old_nbuckets, 0, 0 };
> +
> + ret = bch2_dev_freespace_init(c, ca, old_nbuckets, nbuckets) ?:
> + bch2_trans_do(ca->fs, NULL, NULL, 0,
> + bch2_disk_accounting_mod(trans, &acc, v, ARRAY_SIZE(v)));
> if (ret)
> goto err;
> -
> - /*
> - * XXX: this is all wrong transactionally - we'll be able to do
> - * this correctly after the disk space accounting rewrite
> - */
> - ca->usage_base->d[BCH_DATA_free].buckets += nbuckets - old_nbuckets;
> }
>
> bch2_recalc_capacity(c);
> --
> 2.43.0
>