Re: [EXTERNAL] [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush()
From: Alex Markuze
Date: Wed May 06 2026 - 07:38:37 EST
Hi Slava,
Integer type for loop counter i
The counter i tracks the number of 60-second timeout iterations. To
overflow an int it would need roughly 4,000 years of continuous
waiting. In practice, after CEPH_CAP_FLUSH_MAX_DUMP_COUNT (5)
iterations the diagnostics are suppressed and the counter just ticks
silently; nothing indexes on its value after that point.
I could change it to unsigned int for cleanliness, but it wouldn't
change behavior. Would you prefer that, or do you have a different
concern about the counter?
Thanks,
Alex
On Thu, Apr 30, 2026 at 12:41 AM Viacheslav Dubeyko <vdubeyko@xxxxxxxxxx> wrote:
>
> On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> > Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
> > wait loop that periodically dumps pending cap flush state.
> >
> > The underlying wait semantics remain intact: callers still wait until the
> > requested cap flushes complete. The difference is that long stalls now
> > produce actionable diagnostics instead of looking like a silent hang.
> >
> > CEPH_CAP_FLUSH_MAX_DUMP_COUNT bounds the diagnostics in two ways:
> > it limits the number of entries emitted per diagnostic dump, and it
> > limits the number of timed diagnostic dumps before the wait continues
> > silently. When more entries exist than the per-dump limit, a truncation
> > count is reported. When the dump iteration limit is reached, a final
> > suppression message is emitted so the transition to silence is explicit.
> >
> > The diagnostic dump collects flush entry data under cap_dirty_lock into
> > a bounded on-stack array, then prints after releasing the lock. This
> > avoids holding the spinlock across printk calls.
> >
> > A null cf->ci on the global flush list indicates a bug since all
> > cap_flush entries are initialized with a valid ci before being added.
> > Signal this with WARN_ON_ONCE while still printing enough context for
> > debugging.
> >
> > READ_ONCE is used for the i_last_cap_flush_ack field, which is read
> > outside the inode lock domain. Flush tids are monotonically increasing
> > and acks are processed in order under i_ceph_lock, so the latest ack
> > tid is always the most recently written value.
> >
> > Add a ci pointer to struct ceph_cap_flush so that the diagnostic
> > dump can identify which inode each pending flush belongs to. The
> > new i_last_cap_flush_ack field tracks the latest acknowledged flush
> > tid per inode for diagnostic correlation.
> >
> > This improves reset-drain observability and is also useful for
> > existing sync and writeback troubleshooting paths.
> >
> > Signed-off-by: Alex Markuze <amarkuze@xxxxxxxxxx>
> > ---
> > fs/ceph/caps.c | 10 +++++
> > fs/ceph/inode.c | 1 +
> > fs/ceph/mds_client.c | 97 ++++++++++++++++++++++++++++++++++++++++++--
> > fs/ceph/super.h | 6 +++
> > 4 files changed, 110 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > index cb9e78b713d9..4b37d9ffdf7f 100644
> > --- a/fs/ceph/caps.c
> > +++ b/fs/ceph/caps.c
> > @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
> >
> > spin_lock(&mdsc->cap_dirty_lock);
> > capsnap->cap_flush.tid = ++mdsc->last_cap_flush_tid;
> > + capsnap->cap_flush.ci = ci;
> > list_add_tail(&capsnap->cap_flush.g_list,
> > &mdsc->cap_flush_list);
> > if (oldest_flush_tid == 0)
> > @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
> > return NULL;
> >
> > cf->is_capsnap = false;
> > + cf->ci = NULL;
> > return cf;
> > }
> >
> > @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
> > doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
> >
> > swap(cf, ci->i_prealloc_cap_flush);
> > + cf->ci = ci;
> > cf->caps = flushing;
> > cf->wake = wake;
> >
> > @@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
> > bool wake_ci = false;
> > bool wake_mdsc = false;
> >
> > + /*
> > + * Flush tids are monotonically increasing and acks arrive in
> > + * order under i_ceph_lock, so this is always the latest tid.
> > + * Diagnostic readers use READ_ONCE() without holding the lock.
> > + */
> > + WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid);
> > +
> > list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
> > /* Is this the one that was flushed? */
> > if (cf->tid == flush_tid)
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index f75d66760d54..de465c7e96e8 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -670,6 +670,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
> > INIT_LIST_HEAD(&ci->i_cap_snaps);
> > ci->i_head_snapc = NULL;
> > ci->i_snap_caps = 0;
> > + ci->i_last_cap_flush_ack = 0;
> >
> > ci->i_last_rd = ci->i_last_wr = jiffies - 3600 * HZ;
> > for (i = 0; i < CEPH_FILE_MODE_BITS; i++)
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index b62abae72c4c..d83003acfb06 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -27,6 +27,8 @@
> > #include <trace/events/ceph.h>
> >
> > #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE)
> > +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
> > +#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5
>
> I see the point to collect all timeout declarations in one place. What about to
> place this declarations here [1]?
>
> The CEPH_CAP_FLUSH_MAX_DUMP_COUNT controls two unrelated limits: the number of
> entries printed per dump, and the number of timed diagnostic dumps before
> suppression. I think we need to have two separate constants with distinct names.
>
> >
> > /*
> > * A cluster of MDS (metadata server) daemons is responsible for
> > @@ -2286,19 +2288,106 @@ static int check_caps_flush(struct ceph_mds_client *mdsc,
> > }
> >
> > /*
> > - * flush all dirty inode data to disk.
> > + * Dump pending cap flushes for diagnostic purposes.
> > *
> > - * returns true if we've flushed through want_flush_tid
> > + * cf->ci is safe to dereference here: cap_flush entries hold a
> > + * reference on the inode (via the cap), and entries are removed from
> > + * cap_flush_list under cap_dirty_lock before the cap (and thus the
> > + * inode reference) is released. Holding cap_dirty_lock therefore
> > + * guarantees the inode remains valid for the lifetime of the scan.
> > + */
>
> This commenting looks very strange and confusing. I think we need to have
> comment for the function and comment for the structure. Especially, it's really
> important to have comment for all fields of the structure.
>
> > +struct flush_dump_entry {
> > + u64 ino;
> > + u64 snap;
> > + int caps;
> > + u64 tid;
> > + u64 last_ack;
> > + bool wake;
> > + bool is_capsnap;
> > + bool ci_null;
> > +};
> > +
> > +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
> > +{
> > + struct ceph_client *cl = mdsc->fsc->client;
> > + struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_COUNT];
> > + struct ceph_cap_flush *cf;
> > + int n = 0, remaining = 0;
> > +
> > + spin_lock(&mdsc->cap_dirty_lock);
> > + list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
> > + if (cf->tid > want_tid)
> > + break;
> > + if (n < CEPH_CAP_FLUSH_MAX_DUMP_COUNT) {
> > + struct flush_dump_entry *e = &entries[n++];
> > +
> > + e->ci_null = WARN_ON_ONCE(!cf->ci);
> > + if (!e->ci_null) {
> > + e->ino = ceph_ino(&cf->ci->netfs.inode);
> > + e->snap = ceph_snap(&cf->ci->netfs.inode);
> > + e->last_ack = READ_ONCE(cf->ci->i_last_cap_flush_ack);
> > + }
> > + e->caps = cf->caps;
> > + e->tid = cf->tid;
> > + e->wake = cf->wake;
> > + e->is_capsnap = cf->is_capsnap;
> > + } else {
> > + remaining++;
> > + }
> > + }
> > + spin_unlock(&mdsc->cap_dirty_lock);
> > +
> > + pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
> > + want_tid);
> > + for (int i = 0; i < n; i++) {
> > + struct flush_dump_entry *e = &entries[i];
> > +
> > + if (e->ci_null)
> > + pr_info_client(cl,
> > + " (null ci) %s tid=%llu wake=%d%s\n",
> > + ceph_cap_string(e->caps), e->tid,
> > + e->wake,
> > + e->is_capsnap ? " is_capsnap" : "");
> > + else
> > + pr_info_client(cl,
> > + " %llx.%llx %s tid=%llu last_ack=%llu wake=%d%s\n",
> > + e->ino, e->snap,
> > + ceph_cap_string(e->caps), e->tid,
> > + e->last_ack, e->wake,
> > + e->is_capsnap ? " is_capsnap" : "");
> > + }
> > + if (remaining)
> > + pr_info_client(cl, " ... and %d more pending flushes\n",
> > + remaining);
> > +}
> > +
> > +/*
> > + * Wait for all cap flushes through @want_flush_tid to complete.
> > + * Periodically dumps pending cap flush state for diagnostics.
> > */
> > static void wait_caps_flush(struct ceph_mds_client *mdsc,
> > u64 want_flush_tid)
> > {
> > struct ceph_client *cl = mdsc->fsc->client;
> > + int i = 0;
>
> Is integer type could be enough here? Could we overflow the i value?
>
> Thanks,
> Slava.
>
> > + long ret;
> >
> > doutc(cl, "want %llu\n", want_flush_tid);
> >
> > - wait_event(mdsc->cap_flushing_wq,
> > - check_caps_flush(mdsc, want_flush_tid));
> > + do {
> > + /* 60 * HZ fits in a long on all supported architectures. */
> > + ret = wait_event_timeout(mdsc->cap_flushing_wq,
> > + check_caps_flush(mdsc, want_flush_tid),
> > + CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
> > + if (ret == 0) {
> > + if (i < CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
> > + dump_cap_flushes(mdsc, want_flush_tid);
> > + else if (i == CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
> > + pr_info_client(cl,
> > + "still waiting for cap flushes; suppressing further dumps\n");
> > + i++;
> > + }
> > + } while (ret == 0);
> >
> > doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
> > }
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index 30911ccf961e..9aca42c89ea0 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -238,6 +238,7 @@ struct ceph_cap_flush {
> > bool is_capsnap; /* true means capsnap */
> > struct list_head g_list; // global
> > struct list_head i_list; // per inode
> > + struct ceph_inode_info *ci;
> > };
> >
> > /*
> > @@ -443,6 +444,11 @@ struct ceph_inode_info {
> > struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 or
> > dirty|flushing caps */
> > unsigned i_snap_caps; /* cap bits for snapped files */
> > + /*
> > + * Written under i_ceph_lock, read via READ_ONCE()
> > + * from diagnostic paths.
> > + */
> > + u64 i_last_cap_flush_ack;
> >
> > unsigned long i_last_rd;
> > unsigned long i_last_wr;
>
> [1]
> https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/ceph/libceph.h#L72
>