Re: [PATCH] audit: add backlog high water mark metric

From: Paul Moore

Date: Thu Apr 16 2026 - 16:58:47 EST


On Thu, Apr 16, 2026 at 4:51 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <sgrubb@xxxxxxxxxx> wrote:
> > On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore
> > wrote:
> > > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@xxxxxxxxxx> wrote:
> > > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore
> > wrote:
> > > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > > <rrobaina@xxxxxxxxxx>
> > > > >
> > > > > wrote:
> > > > ...
> > > >
> > > > > ... compliance-driven systems that must use a finite backlog limit for
> > > > > memory safety but cannot tolerate dropped events ...>
> > > > You must pick one of those two requirements, or at the very least
> > > > prioritize them; it is simply impossible to both limit the backlog
> > > > queue and require zero dropped events.
> > >
> > > To be perfectly honest, it's also impossible to require zero dropped
> > > events. Even in the most extreme configurations where the admin
> > > decides to panic the system, that only happens once the system reaches
> > > the point where it is dropping events. We try *really* hard to not
> > > drop events, but it is always going to be a possibility.
> >
> > You're helping make the point. Those administrators have decided reliable
> > auditing is more important than system availability.
>
> Users prioritizing reliable auditing over system availability should
> not run with a backlog limit. It's that simple.

To clarify this further, even on systems without a backlog limit and a
panic-on-loss configuration, there is still a possibility that the
system could lose an event when it hits the edge before it panics. A
maximum backlog stat won't help here. Even if you had a way to
capture the backlog size before the system took itself out, the size
is flirting with the maximum resource limits of the system, it would
be silly to use that as a configured backlog limit, you would still
want to leave the limit at 0/disabled.

> Regardless, I'm still not convinced this maximum backlog stat alone
> will solve any meaningful problems. If your audit log is predictable
> enough that this metric has value, it should be possible to either
> capture the backlog size during periods of high audit load or simply
> run the system through that load and verify it doesn't crash and go to
> hell. If your audit log isn't predictable, capturing a maximum
> backlog size doesn't really mean anything since it is still a snapshot
> of one instance of the system and there is always the possibility of
> the system exceeding it.

--
paul-moore.com