Re: [PATCH] audit: add backlog high water mark metric

From: Paul Moore

Date: Tue May 12 2026 - 12:20:47 EST


On Fri, Apr 17, 2026 at 9:02 AM Ricardo Robaina <rrobaina@xxxxxxxxxx> wrote:
> On Thu, Apr 16, 2026 at 5:58 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > On Thu, Apr 16, 2026 at 4:51 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > > On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <sgrubb@xxxxxxxxxx> wrote:
> > > > On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore
> > > > wrote:
> > > > > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > > > > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@xxxxxxxxxx> wrote:
> > > > > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore
> > > > wrote:
> > > > > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > > > > <rrobaina@xxxxxxxxxx>
> > > > > > >
> > > > > > > wrote:
> > > > > > ...
> > > > > >
> > > > > > > ... compliance-driven systems that must use a finite backlog limit for
> > > > > > > memory safety but cannot tolerate dropped events ...>
> > > > > > You must pick one of those two requirements, or at the very least
> > > > > > prioritize them; it is simply impossible to both limit the backlog
> > > > > > queue and require zero dropped events.
> > > > >
> > > > > To be perfectly honest, it's also impossible to require zero dropped
> > > > > events. Even in the most extreme configurations where the admin
> > > > > decides to panic the system, that only happens once the system reaches
> > > > > the point where it is dropping events. We try *really* hard to not
> > > > > drop events, but it is always going to be a possibility.
> > > >
> > > > You're helping make the point. Those administrators have decided reliable
> > > > auditing is more important than system availability.
> > >
> > > Users prioritizing reliable auditing over system availability should
> > > not run with a backlog limit. It's that simple.
> >
> > To clarify this further, even on systems without a backlog limit and a
> > panic-on-loss configuration, there is still a possibility that the
> > system could lose an event when it hits the edge before it panics. A
> > maximum backlog stat won't help here. Even if you had a way to
> > capture the backlog size before the system took itself out, the size
> > is flirting with the maximum resource limits of the system, it would
> > be silly to use that as a configured backlog limit, you would still
> > want to leave the limit at 0/disabled.
> >
> > > Regardless, I'm still not convinced this maximum backlog stat alone
> > > will solve any meaningful problems. If your audit log is predictable
> > > enough that this metric has value, it should be possible to either
> > > capture the backlog size during periods of high audit load or simply
> > > run the system through that load and verify it doesn't crash and go to
> > > hell. If your audit log isn't predictable, capturing a maximum
> > > backlog size doesn't really mean anything since it is still a snapshot
> > > of one instance of the system and there is always the possibility of
> > > the system exceeding it.
> >
> > --
> > paul-moore.com
> >
>
> Hi Paul,
>
> Thanks for reviewing the patch and giving your perspective on it.
>
> I understand your point that if a system truly prioritizes auditing
> over everything else, it shouldn't run with a limit. However, in
> practice, there is a middle ground where compliance frameworks or
> internal infrastructure policies require a finite backlog limit to
> ensure memory safety, while still demanding reliable auditing.

It is important that those users understand they are believing a lie
if they think one can demand reliable auditing with a finite backlog
limit.

> I'd like to ask what specific metric or combination of metrics you
> would be willing to consider? You mentioned average queue length
> earlier, and Steve suggested combining the max depth with a
> backlog_lost_since_reset counter. I'm happy to work on a v2 that
> addresses your concerns while still delivering the metrics audit users
> currently lack.

My suggestion would be to put forth a proposal explaining the problems
you want to solve and what metrics you believe are important towards
solving those problems. I agree that the current list of audit
metrics are rather sparse, but as we've seen here, I don't think we
yet have agreement on what metrics would be useful. My hope is that
having a discussion on the metrics first could avoid false starts as
we've seen here.

--
paul-moore.com