Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys

From: Brian Foster
Date: Thu Feb 29 2024 - 13:42:39 EST


On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> > > + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> > > + struct bkey u;
> > > + struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> > > +
> > > + if (k.k->type == KEY_TYPE_accounting)
> > > + bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> > > + bkey_s_c_to_accounting(k));
> >
> > So it looks like we're accumulating from the btree key into the write
> > buffer key. Is this so the following code will basically insert a new
> > btree key based on the value of the write buffer key?
>
> Correct, this is where we go from "accounting keys is a delta" to
> "accounting key is new version of the key".
>
> > > darray_for_each(wb->sorted, i) {
> > > struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
> > > + bool accounting_accumulated = false;
> >
> > Should this live within the interior flush loop?
>
> We can't define it within the loop because then we'd be setting it to
> false on every loop iteration... but it does belong _with_ the loop, so
> I'll move it to right before.
>

Ah, right.

> > > - bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > - bch2_btree_write_buffer_journal_flush);
> > > + if (!accounting_replay_done &&
> > > + i->k.k.type == KEY_TYPE_accounting) {
> > > + could_not_insert++;
> > > + continue;
> > > + }
> > > +
> > > + if (!could_not_insert)
> > > + bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > + bch2_btree_write_buffer_journal_flush);
> >
> > Hmm.. so this is sane because the slowpath runs in journal sorted order,
> > right?
>
> yup, which means as soon as we hit a key we can't insert we can't
> release any more journal pins
>
> >
> > >
> > > bch2_trans_begin(trans);
> > >
> > > @@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> > > btree_write_buffered_insert(trans, i));
> > > if (ret)
> > > goto err;
> > > +
> > > + i->journal_seq = 0;
> > > + }
> > > +
> >
> > /*
> > * Condense the remaining keys <reasons reasons>...??
> > */
>
> yup, that's a good comment
>
> > > + if (could_not_insert) {
> > > + struct btree_write_buffered_key *dst = wb->flushing.keys.data;
> > > +
> > > + darray_for_each(wb->flushing.keys, i)
> > > + if (i->journal_seq)
> > > + *dst++ = *i;
> > > + wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> > > }
> > > }
> > > err:
> > > + if (ret || !could_not_insert) {
> > > + bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > + wb->flushing.keys.nr = 0;
> > > + }
> > > +
> > > bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
> > > - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> > > - bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > - wb->flushing.keys.nr = 0;
> > > + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
> >
> > I feel like the last time I looked at the write buffer stuff the flush
> > wasn't reentrant in this way. I.e., the flush switched out the active
> > buffer and so had to process all entries in the current buffer (or
> > something like that). Has something changed or do I misunderstand?
>
> Yeah, originally we were adding keys to the write buffer directly from
> the transaction commit path, so that necessitated the super fast
> lockless stuff where we'd toggle between buffers so one was always
> available.
>
> Now keys are pulled from the journal, so we can use (somewhat) simpler
> locking and buffering; now the complication is that we can't predict in
> advance how many keys are going to come out of the journal for the write
> buffer.
>

Ok.

> >
> > > return ret;
> > > }
> > >
> > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > > index 6829d80bd181..b8289af66c8e 100644
> > > --- a/fs/bcachefs/recovery.c
> > > +++ b/fs/bcachefs/recovery.c
> > > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> > > goto err;
> > > }
> > >
> > > + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> > > +
> >
> > I assume this ties into the question on the previous patch..
> >
> > Related question.. if the write buffer can't flush during journal
> > replay, is there concern/risk of overflowing it?
>
> Shouldn't be any actual risk. It's just new accounting updates that the
> write buffer can't flush, and those are only going to be generated by
> interior btree node updates as journal replay has to split/rewrite nodes
> to make room for its updates.
>
> And for those new acounting updates, updates to the same counters get
> accumulated as they're flushed from the journal to the write buffer -
> see the patch for eytzingcer tree accumulated. So we could only overflow
> if the number of distinct counters touched somehow was very large.
>
> And the number of distinct counters will be growing significantly, but
> the new counters will all be for user data, not metadata.
>
> (Except: that reminds me, we do want to add per-btree counters, so users
> can see "I have x amount of extents, x amount of dirents, etc.).
>

Heh, Ok. This all does sound a little open ended to me. Maybe the better
question is: suppose this hypothetically does happen after adding a
bunch of new counters, what would the expected side effect be in the
recovery scenario where the write buffer can't be flushed?

If write buffer updates now basically just journal a special entry,
would that basically mean we'd deadlock during recovery due to no longer
being able to insert journal entries due to a pinned write buffer? If
so, that actually seems reasonable to me in the sense that in theory it
at least doesn't break the filesystem on-disk, but obviously it would
require some kind of enhancement in order to complete the recovery (even
if what that is is currently unknown). Hm?

Brian