Re: data-race in audit_log_start / audit_receive

From: Paul Moore
Date: Fri Aug 19 2022 - 08:07:03 EST


On Thu, Aug 18, 2022 at 9:59 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> On Thu, Aug 18, 2022 at 6:23 PM Abhishek Shah
> <abhishek.shah@xxxxxxxxxxxx> wrote:
> > Hi all,
> >
> > We found a data race involving the audit_cmd_mutex.owner variable. We think this bug is concerning because audit_ctl_owner_current is used at a location that controls the scheduling of tasks shown here. Please let us know what you think.
> >
> > Thanks!
> >
> > -----------------Report----------------------
> >
> > write to 0xffffffff881d0710 of 8 bytes by task 6541 on cpu 0:
> > audit_ctl_lock kernel/audit.c:237 [inline]
>
> ...
>
> > read to 0xffffffff881d0710 of 8 bytes by task 6542 on cpu 1:
> > audit_ctl_owner_current kernel/audit.c:258 [inline]
>
> Yes, technically there is a race condition if/when an auditd instance
> is registering itself the exact same time as another task is
> attempting to log an audit record via audit_log_start().

I realized after I sent this and turned off my computer last night
that I typed the wrong thing - the race isn't between auditd and
audit_log_start(), it's between the code which changes the audit
subsystem state (see audit_receive() and the audit watch/tree code)
and audit_log_start().

> The risk
> being that a *very* limited number of audit records could be
> mis-handled with respect to their queue priority and that is it; no
> records would be lost or misplaced. Correcting this would likely
> involve a more complex locking scheme[1] or a rather severe
> performance penalty due to an additional lock in the audit_log_start()
> code path. There may be some value in modifying
> audit_ctl_owner_current() to use READ_ONCE(), but it isn't clear to me
> that this would significantly improve things or have no impact on
> performance.

Another thing I thought of last night - I don't believe READ_ONCE()
adds a memory barrier, which would probably be needed; although my
original statement still stands, I'm not sure the performance hit
would justify the marginal impact on the audit queue.

> Have you noticed any serious problems on your system due to this? If
> you have a reproducer which shows actual harm on the system could you
> please share that?
>
> [1] The obvious choice would be to move to a RCU based scheme, but
> even that doesn't totally solve the problem as there would still be a
> window where some tasks would have an "old" value. It might actually
> end up extending the race window on large multi-core systems due to
> the time needed for all of the critical sections to complete.

--
paul-moore.com