Re: [GIT PULL] bcachefs fixes for 6.12-rc2

From: Kent Overstreet
Date: Sat Oct 05 2024 - 23:43:15 EST


On Sat, Oct 05, 2024 at 08:06:31PM GMT, Carl E. Thompson wrote:
> Yeah, of course there were the disk accounting issues and before that
> was the kernel upgrade-downgrade bug going from 6.8 back to 6.7.
> Currently over on Reddit at least one user is mention read errors and
> / or performance regressions on the current RC version that I'd rather
> avoid.

So, disk accounting rewrite: that code was basically complete, just
baking, for a full six months before merging - so, not exactly rushed,
and it saw user testing before merging. Given the size, and how invasive
it was, some regressions were inevitable and they were pretty small and
localized.

The upgrade/downgrade bug was really nasty, yeah.

> There were a number of other issues that cropped up in some earlier
> versions but not others such as deadlocks when using compression
> (particularly zstd), weirdness when using compression with 4k blocks
> and suspend / resume failures when using bcachefs.

I don't believe any of those were bcachefs regressions, although some
are bcachefs bugs - suspend/resume for example there's still an open
bug.

I've seen multiple compression bugs that were mostly not bcachefs bugs
(i.e. there was a zstd bug that affected bcachefs that took forever to
fix, and there's a recently reported LZ4HC bug that may or may not be
bcachefs).

> None of those things were a big deal to me as I mostly only use
> bcachefs on root filesystems which are of course easy to recreate. But
> I do currently use bcachefs for all the filesystems on my main laptop
> so issues there can be more of a pain.

Are you talking about issues you've hit, or issues that you've seen
reported? Because the main subject of discussion is regressions.

>
> As an example of potential issues I'd like to avoid I often upgrade my
> laptop and swap the old SSD in and am currently considering pulling
> the trigger on a Ryzen AI laptop such as the ProArt P16. However, this
> new processor has some cutting edge features only fully supported in
> 6.12 so I'd prefer to use that kernel if I can. But... because
> according to Reddit there are apparently issues with bcachefs in the
> 6.12RC kernels that means I am hesitant to buy the laptop and use the
> RC kernel the carefree manor I normally would. Yeah, first world
> problems!

The main 6.12-rc1 issue was actually caused by Christain's change to
inode state wakeups - it was a VFS change where bcachefs wasn't updated.

That should've been caught by automated testing on fs-next - so that
one's on me; fs-next is still fairly new and I still need to get that
going.

> Speaking of Reddit, I don't know if you saw it but a user there quotes
> you as saying users who use release candidates should expect them to
> be "dangerous as crap." I could not find a post where you said that in
> the thread that user pointed to but if you **did** say something like
> that then I guess I have a different concept of what "release
> candidate" means.

I don't recall saying that, but I did say something about Canonical
shipping rc kernels to the general population - that's a bit crazy.
Rc kernels should generally be run by users who know what they're
getting into and have some ability to help test and debug.

> So for me it would be a lot easier if bcachefs versions were decoupled
> from kernel versions.

Well, this sounds more like generalized concern than anything concrete I
can act on, to be honest - but if you've got regressions that you've
been hit by, please tell me about those.

The feedback I've generally been getting has been that each release has
been getting steadily better, and more stable and usable - and lately
pretty much all I've been doing has been fixing user reported bugs, so
those I naturally want to get out quickly if the bugs are serious enough
and I'm confident that they'll be low risk - and there has been a lot of
that.

The shrinker fixes for fsck OOMing that didn't land in 6.11 were
particularly painful for a lot of users.

The key cache/rcu pending work that didn't land in 6.11, that was a
major usability issue for several users that I talked to.

The past couple weeks I've been working on filesystem repair and
snapshots issues for several users that were inadvertently torture
testing snapshots - the fixes are turning out to be fairly involved, but
I'm also weighing there "how likely are other users to be affected by
this, and do we want to wait another 3 months", and I've got multiple
reports of affected users.