Re: RCU stalls and GPFs in ceph/netfs

From: Max Kellermann
Date: Sun Jul 28 2024 - 09:17:46 EST


On Sun, Jul 28, 2024 at 1:45 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> That is really weird. AFAICT, 2e9d7e4b984a61 is just removing some
> wrapper functions and changing the names of some others. There should
> be no functional changes there.

Exactly what I thought, I could not imagine how this commit could
cause such a bug. The only chance was that netfs_rreq_assess() now
always directly calls netfs_rreq_completed(), but not
netfs_rreq_write_to_cache(), but I don't know what that means - this
different code path could be a candidate for doing something
differently. Maybe it's an old bug that only got revealed by this
change.

Anyway, I tried to verify this and the preceding commit for hours, and
the picture was consistent: that commit reproduces the RCU stall
within minutes (though only 50% or so of all boots), and the previous
commit never did. There is still a tiny chance that I just wasn't
trying hard enough. I'm out of ideas, and all I can do now is start
digging really deeply into this code, but I thought it would be more
productive to reach out to the people who wrote it.

Max