Re: [PATCH] audit: add backlog high water mark metric

From: Ricardo Robaina

Date: Fri Apr 17 2026 - 09:02:46 EST


On Thu, Apr 16, 2026 at 5:58 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
>
> On Thu, Apr 16, 2026 at 4:51 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > On Thu, Apr 16, 2026 at 4:33 PM Steve Grubb <sgrubb@xxxxxxxxxx> wrote:
> > > On Wednesday, April 15, 2026 11:21:52 AM Eastern Daylight Time Paul Moore
> > > wrote:
> > > > On Wed, Apr 15, 2026 at 11:19 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > > > > On Tue, Apr 14, 2026 at 11:45 PM Steve Grubb <sgrubb@xxxxxxxxxx> wrote:
> > > > > > On Friday, April 10, 2026 5:34:08 PM Eastern Daylight Time Paul Moore
> > > wrote:
> > > > > > > On Mon, Mar 23, 2026 at 11:07 AM Ricardo Robaina
> > > > > > > <rrobaina@xxxxxxxxxx>
> > > > > >
> > > > > > wrote:
> > > > > ...
> > > > >
> > > > > > ... compliance-driven systems that must use a finite backlog limit for
> > > > > > memory safety but cannot tolerate dropped events ...>
> > > > > You must pick one of those two requirements, or at the very least
> > > > > prioritize them; it is simply impossible to both limit the backlog
> > > > > queue and require zero dropped events.
> > > >
> > > > To be perfectly honest, it's also impossible to require zero dropped
> > > > events. Even in the most extreme configurations where the admin
> > > > decides to panic the system, that only happens once the system reaches
> > > > the point where it is dropping events. We try *really* hard to not
> > > > drop events, but it is always going to be a possibility.
> > >
> > > You're helping make the point. Those administrators have decided reliable
> > > auditing is more important than system availability.
> >
> > Users prioritizing reliable auditing over system availability should
> > not run with a backlog limit. It's that simple.
>
> To clarify this further, even on systems without a backlog limit and a
> panic-on-loss configuration, there is still a possibility that the
> system could lose an event when it hits the edge before it panics. A
> maximum backlog stat won't help here. Even if you had a way to
> capture the backlog size before the system took itself out, the size
> is flirting with the maximum resource limits of the system, it would
> be silly to use that as a configured backlog limit, you would still
> want to leave the limit at 0/disabled.
>
> > Regardless, I'm still not convinced this maximum backlog stat alone
> > will solve any meaningful problems. If your audit log is predictable
> > enough that this metric has value, it should be possible to either
> > capture the backlog size during periods of high audit load or simply
> > run the system through that load and verify it doesn't crash and go to
> > hell. If your audit log isn't predictable, capturing a maximum
> > backlog size doesn't really mean anything since it is still a snapshot
> > of one instance of the system and there is always the possibility of
> > the system exceeding it.
>
> --
> paul-moore.com
>

Hi Paul,

Thanks for reviewing the patch and giving your perspective on it.

I understand your point that if a system truly prioritizes auditing
over everything else, it shouldn't run with a limit. However, in
practice, there is a middle ground where compliance frameworks or
internal infrastructure policies require a finite backlog limit to
ensure memory safety, while still demanding reliable auditing.

Currently, audit users are looking for a way to tune the system based
on an optimal setting for their workload that satisfies memory
constraints while practically minimizing dropped events to near-zero.
I strongly believe such users would make good use of backlog_max_depth
because it lets them know what the worst-case scenarios look like and
how big the spikes can be. This allows them to base their tuning
decisions on real data rather than guesswork, as is usually done
nowadays. Other than that, exposing such metrics would allow users to
leverage services like tuned to dynamically adjust limits based on
workload.

That being said, I hear your concern about whether a single "max"
value alone is worth consuming a bit in the audit_status bitmask. So,
I'd like to ask what specific metric or combination of metrics you
would be willing to consider? You mentioned average queue length
earlier, and Steve suggested combining the max depth with a
backlog_lost_since_reset counter. I'm happy to work on a v2 that
addresses your concerns while still delivering the metrics audit users
currently lack.

-Ricardo