Re: [PATCH v6 3/6] nfsd: implement server-stats-get netlink handler

From: Chuck Lever

Date: Fri Jun 19 2026 - 16:42:56 EST

On Fri, Jun 19, 2026, at 11:26 AM, Jeff Layton wrote:
> Implement nfsd_nl_server_stats_get_dumpit() which exposes the
> NFS server statistics currently available via /proc/net/rpc/nfsd
> through the nfsd generic netlink family.

A couple of questions on the dump design and one on the commit message,
then some smaller items.

The commit message says:

> - First message: all scalar stats (reply cache, filehandle,
> IO, network, RPC) plus per-version procedure counts
> (proc2/3/4-ops) using per-netns vs_count arrays.
>
> - Subsequent messages: NFSv4 per-operation counts
> (proc4ops-ops), one entry per message, using cb->args[0]
> to track the current operation index across dump calls.

The code does not stream across subsequent messages. The proc4ops-ops
nests are emitted in the first message, inside the same "if (start == 0)"
block as the scalars, and cb->args[0] only ever holds 0 or -1, never an
operation index. The function's own kerneldoc agrees: "The entire
server-stats object is emitted in a single netlink message on the first
invocation."

Should the commit message be reworded to match the code, or should the code
move to the streaming design the message describes? That second option ties
into the next question.

> diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
> index 601301e34fc7..f0514d8149cd 100644
> --- a/fs/nfsd/nfsctl.c
> +++ b/fs/nfsd/nfsctl.c
> @@ -2329,6 +2329,185 @@ int nfsd_nl_cache_flush_doit(struct sk_buff *skb, struct genl_info *info)
> return 0;
> }
>
> +static int nfsd_nl_fill_proc_ops(struct sk_buff *skb, int attr,
> + unsigned long __percpu *counts,
> + unsigned int nproc)
> +{
> + struct nlattr *nest;
> + unsigned int j;
> + int k;
> +
> + for (j = 0; j < nproc; j++) {
> + unsigned long count = 0;
> +
> + for_each_possible_cpu(k)
> + count += per_cpu(counts[j], k);
> +
> + nest = nla_nest_start(skb, attr);
> + if (!nest)
> + return -EMSGSIZE;
> + if (nla_put_u32(skb, NFSD_A_SERVER_PROC_ENTRY_OP, j) ||
> + nla_put_u64_64bit(skb, NFSD_A_SERVER_PROC_ENTRY_COUNT,
> + count, NFSD_A_SERVER_PROC_ENTRY_PAD)) {
> + nla_nest_cancel(skb, nest);
> + return -EMSGSIZE;
> + }
> + nla_nest_end(skb, nest);
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * nfsd_nl_server_stats_get_dumpit - dump NFS server statistics
> + * @skb: reply buffer
> + * @cb: netlink metadata and command arguments
> + *
> + * The entire server-stats object is emitted in a single netlink message on
> + * the first invocation. cb->args[0] is set to -1 afterwards so that the next
> + * invocation terminates the dump.
> + *
> + * Returns the size of the reply or a negative errno.
> + */
> +int nfsd_nl_server_stats_get_dumpit(struct sk_buff *skb,
> + struct netlink_callback *cb)
> +{
> + struct net *net = sock_net(skb->sk);
> + struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> + struct svc_stat *statp = &nn->nfsd_svcstats;
> + struct svc_program *prog = statp->program;
> + int start = cb->args[0];
> + void *hdr;
> +
> + /*
> + * cb->args[0] == 0: first call, emit the full server-stats object
> + * cb->args[0] < 0: dump already complete
> + */
> + if (start < 0)
> + return 0;
> +
> + if (start == 0) {
> + hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid,
> + cb->nlh->nlmsg_seq, &nfsd_nl_family,
> + NLM_F_MULTI, NFSD_CMD_SERVER_STATS_GET);
> + if (!hdr)
> + return -ENOBUFS;
> +

This handler assembles the entire server-stats object -- all scalars, the
proc2/3/4-ops nests, and one proc4ops-ops nest per NFSv4 operation -- in one
message under "if (start == 0)", then sets cb->args[0] = -1 so the next call
ends the dump.

Does that hold up once the assembled message no longer fits the dump buffer?
A dump skb is allocated at max(cb->min_dump_alloc, NLMSG_GOODSIZE), and this
handler sets neither cb->min_dump_alloc nor a .start callback, so the floor
is NLMSG_GOODSIZE, which is SKB_WITH_OVERHEAD(PAGE_SIZE), roughly 3.7k on a
4k-page host. A larger skb is allocated only when userspace has previously
grown its receive buffer.

With LAST_NFS4_OP at 75, the proc4ops-ops loop alone emits 76 nests of about
28 bytes each (~2.1k); the proc2 and proc3 nests plus the scalar block bring
the total to roughly 3.5k -- already within a few hundred bytes of the floor,
and growing by one nest for every NFSv4 operation added later.

[ ... scalar puts for reply cache, fh, io, net, rpc snipped ... ]

> + /* Per-version procedure counts */
> + if (statp->vs_count) {
> + static const int proc_attrs[] = {
> + [2] = NFSD_A_SERVER_STATS_PROC2_OPS,
> + [3] = NFSD_A_SERVER_STATS_PROC3_OPS,
> + [4] = NFSD_A_SERVER_STATS_PROC4_OPS,
> + };
> + unsigned int i;
> +
> + for (i = 0; i < prog->pg_nvers &&
> + i < ARRAY_SIZE(proc_attrs); i++) {
> + if (!prog->pg_vers[i] ||
> + !statp->vs_count[i])
> + continue;
> + if (!proc_attrs[i])
> + continue;
> + if (nfsd_nl_fill_proc_ops(skb,
> + proc_attrs[i],
> + statp->vs_count[i],
> + prog->pg_vers[i]->vs_nproc))
> + goto err_cancel;
> + }
> + }
> +
> +#ifdef CONFIG_NFSD_V4
> + /* NFSv4 individual operation counts */
> + for (int i = 0; i <= LAST_NFS4_OP; i++) {
> + struct nlattr *nest;
> + u64 cnt;
> +
> + cnt = percpu_counter_sum_positive(
> + &nn->counter[NFSD_STATS_NFS4_OP(i)]);
> +
> + nest = nla_nest_start(skb,
> + NFSD_A_SERVER_STATS_PROC4OPS_OPS);
> + if (!nest)
> + goto err_cancel;
> + if (nla_put_u32(skb, NFSD_A_SERVER_PROC_ENTRY_OP, i) ||
> + nla_put_u64_64bit(skb, NFSD_A_SERVER_PROC_ENTRY_COUNT,
> + cnt, NFSD_A_SERVER_PROC_ENTRY_PAD)) {
> + nla_nest_cancel(skb, nest);
> + goto err_cancel;
> + }
> + nla_nest_end(skb, nest);
> + }
> +#endif

This loop open-codes the same nest that nfsd_nl_fill_proc_ops() builds just
above -- nla_nest_start(), nla_put_u32(PROC_ENTRY_OP),
nla_put_u64_64bit(PROC_ENTRY_COUNT), nla_nest_end() -- into the same
NFSD_A_SERVER_STATS_PROC4OPS_OPS attribute. Could the helper be generalized
to take the per-op counter source so this is not a second copy of the same
code?

The per-version block above skips empty versions:

if (!prog->pg_vers[i] || !statp->vs_count[i])
continue;

but this loop emits an entry for every op 0..LAST_NFS4_OP, zero-count ops
included. Is that difference intentional? Skipping zero counts here would
also trim the worst-case message size above.

There is also a counter that this dump does not emit. /proc/net/rpc/nfsd
prints a wdeleg_getattr line after proc4ops:

seq_printf(seq, "\nwdeleg_getattr %lld",
percpu_counter_sum_positive(&nn->counter[NFSD_STATS_WDELEG_GETATTR]));

incremented by nfsd_stats_wdeleg_getattr_inc(). Since the goal is to expose
the statistics currently available via /proc/net/rpc/nfsd, should
wdeleg_getattr get an attribute here too, so nfsstat over netlink does not
drop it relative to the procfs path?

> +
> + genlmsg_end(skb, hdr);
> + }
> +
> + cb->args[0] = -1;
> + return skb->len;
> +
> +err_cancel:
> + genlmsg_cancel(skb, hdr);
> + return -EMSGSIZE;
> +}

Back to the buffer question: when an nla_put eventually fails, control
reaches err_cancel, and genlmsg_cancel() trims the partial message back out,
so skb->len is 0 on return. In netlink_dump():

if (nlk->dump_done_errno == -EMSGSIZE && skb->len)
nlk->dump_done_errno = skb->len;

the skb->len test fails, so -EMSGSIZE is delivered to userspace as a
terminal error rather than being taken as "buffer full, call again". With
everything emitted under "if (start == 0)" and no per-attribute cursor in
cb->args, a retry rebuilds the same oversized message, so the dump cannot
ever complete.

nfsd_nl_rpc_status_get_dumpit() already follows the streaming pattern the
commit message describes: one entry per message, saving the cursor in
cb->args and returning skb->len on EMSGSIZE. Would building this dump the
same way be more robust than the single-message form?

--
Chuck Lever