Re: [PATCH] audit: add backlog high water mark metric

From: Steve Grubb

Date: Tue Apr 14 2026 - 23:46:51 EST

Hello Paul,

On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore wrote:
> On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina <rrobaina@xxxxxxxxxx>
wrote:
> > Currently, determining the optimal `audit_backlog_limit` relies on
> > instantaneous polling of the queue size. This misses transient
> > micro-bursts, making it difficult for system administrators to know
> > if their queue is adequately sized or if they are at risk of
> > dropping events.
> >
> > This patch introduces `backlog_max_depth`, a high-water mark metric
> > that tracks the maximum number of buffers in the audit queue since
> > the system was booted or the metric was last reset. To minimize
> > performance overhead in the fast-path, the metric is updated using
> > a lockless cmpxchg loop in `__audit_log_end()`.
> >
> > Userspace can read-and-clear this metric by sending an `AUDIT_SET`
> > message with the `AUDIT_STATUS_BACKLOG_MAX_DEPTH` mask. To support
> > periodic telemetry polling (e.g., statsd, Prometheus), the reset
> > operation atomically returns the snapshot of the high-water mark
> > right before zeroing it, ensuring no peaks are lost between polls.
> >
> > Link: https://github.com/linux-audit/audit-kernel/issues/63
> > Suggested-by: Steve Grubb <sgrubb@xxxxxxxxxx>
> > Signed-off-by: Ricardo Robaina <rrobaina@xxxxxxxxxx>
> > ---
> >
> > include/linux/audit.h | 3 ++-
> > include/uapi/linux/audit.h | 2 ++
> > kernel/audit.c | 32 ++++++++++++++++++++++++++++++++
> > 3 files changed, 36 insertions(+), 1 deletion(-)
>
> I sat on this for a bit because I wanted to think on it for a while.
> While I agree audit could benefit from better statistics around
> queue/backlog status, I'm not sure a single "max" value alone is worth
> a bit in the audit_status bitmask. My concern is that the max queue
> length only provides a single snapshot of what the queue looked like,
> it doesn't give any indication of the average queue length over a span
> of time. Some audit users are willing to live with occasional drops,
> and the max size doesn't help them arrive at a good balance.
>
> As for the users who can't tolerate any audit record drops? They
> shouldn't be running with a backlog limit anyway so the maximum queue
> value will be of limit use.

The existing audit_lost counter tells administrators they have already
failed; the proposed backlog_max_depth tells them they are at risk of
failing. These are different signals serving different operational needs. The
dominant real-world deployment — compliance-driven systems that must use a
finite backlog limit for memory safety but cannot tolerate dropped events —
has no existing mechanism to verify their limit is correctly sized between
polling intervals. Instantaneous backlog polling is blind to sub-second
bursts. Only a high-water mark, atomically reset at each poll, closes this
gap. The average queue length would not answer the question 'did I ever come
close to the limit?' — only the maximum can.

On the bitmask concern: the last addition was
AUDIT_STATUS_BACKLOG_WAIT_TIME_ACTUAL, six years ago.

If you don't think this closes the gap on what people need, the patch could
be amended to include backlog_lost_since_reset (drops since last poll)
alongside the max so that you get two metrics for the price of one bit. But
this is absolutely needed because people are flying blind without it.

-Steve