Re: [RFC PATCH] audit: Send netlink ACK before setting connection in auditd_set

From: Paul Moore
Date: Mon Oct 16 2023 - 16:17:04 EST

Next message: Alex Williamson: "Re: [PATCH 1/2] vfio/mtty: Fix eventfd leak"
Previous message: Dorcas AnonoLitunya: "[PATCH v2 5/5] Staging: sm750fb: Rename ddk750_setModeTiming"
In reply to: Chris Riches: "Re: [RFC PATCH] audit: Send netlink ACK before setting connection in auditd_set"
Next in thread: Chris Riches: "Re: [RFC PATCH] audit: Send netlink ACK before setting connection in auditd_set"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Oct 16, 2023 at 1:12 PM Chris Riches <chris.riches@xxxxxxxxxxx> wrote:
>

Thanks for trimming the email in your reply, however, it is helpful to
preserve those "On Mon, Oct ..." headers for those emails which you
include in your reply, it helps keep things straight when reading the
email. Not a big deal, just something to keep in mind for next time.

> > I think the basic approach is good, but I think we can simply things
> > slightly by using a resettable boolean as opposed to an integer flag
> > for the ACK. I've pasted in a quick, untested patch (below) to better
> > demonstrate the idea, let me know what you think.
>
> The simplified patch doesn't look quite right. I had some trouble
> getting it to apply (seems tabs were changed into spaces, even when I
> downloaded the raw email).

I should have been more clear, that's what just a quick hack that I
cut-n-pasted into the email body, whitespace damage was a given.
Typically if I include a patch with the qualification that it is
untested, you can expect problems :) but I'll try to make the pitfalls
more explicit in the future.

I usually include these hacky patches simply as a way to help clear up
any misunderstandings that might happen when trying to explain an
approach using English descriptions. Much in the same way we say that
a picture is worth a thousand words, I believe a patch, even a
relatively crude one, is worth at least as many words as a picture :)

> While typing it out manually, I noticed that
> the condition for sending the ACK isn't correct - if NLM_F_ACK is 0 to
> begin with, then ack will be false to begin with, and so no ACK will be
> sent even if there is an error.

Good point. I'll just casually remind you that I did say "untested" ;)

I believe the following should work as intended (untested, cut-n-paste, etc.):

@@ -1538,9 +1551,10 @@ static int audit_receive_msg(struct sk_buff *skb, struct
nlmsghdr *nlh)
* Parse the provided skb and deal with any messages that may be present,
* malformed skbs are discarded.
*/
-static void audit_receive(struct sk_buff *skb)
+static void audit_receive(struct sk_buff *skb)
{
struct nlmsghdr *nlh;
+ bool ack;
/*
* len MUST be signed for nlmsg_next to be able to dec it below 0
* if the nlmsg_len was not aligned
@@ -1553,9 +1567,13 @@ static void audit_receive(struct sk_buff *skb)

audit_ctl_lock();
while (nlmsg_ok(nlh, len)) {
- err = audit_receive_msg(skb, nlh);
- /* if err or if this message says it wants a response */
- if (err || (nlh->nlmsg_flags & NLM_F_ACK))
+ ack = true;
+ err = audit_receive_msg(skb, nlh, &ack);
+
+ /* if audit_receive_msg() clears @ack then we never want to
+ * send an ack here, otherwise send an ack on error or if
+ * requested */
+ if (ack && (err || (nlh->nlmsg_flags & NLM_F_ACK)))
netlink_ack(skb, nlh, err, NULL);

> Handling this case is why I used the
> three-state system to begin with. I think we could still use a boolean
> with a condition of just (err || ack), with the caveat that we wouldn't
> easily be able to send an early ACK on an error (not that the current
> patch needs this - just thinking of reusability).

While there may be three states (never ACK, ACK on error, always ACK),
the audit_receive_msg() function only needs to report back one of two
states: never ACK or ACK if desired.

> > Regardless of any other issues, I think you brought up a good point
> > about there being socket buffer contention when the queues are
> > full(ish) and an audit daemon connects to the kernel and while I'm not
> > sure that this patch will fully resolve that issue, I do think it
> > would be good to have (I'm doubtful if it can be fully resolved
> > without some really awful hacks).
>
> That roughly lines up with what I was seeing - could you elaborate any
> further on what "awful hacks" would be needed to fully resolve this?

I'm not sure I can recall everything from when I was thinking about
this previously (that was about a week ago), but my quick thoughts
right now are that you would need a lot more information and/or
handshakes between the kernel and the daemon.

Unfortunately, both the current audit design and implementation is
seriously flawed in a number of areas. One of these areas is the fact
that data and control messages are sent using the same data flow.

> > The ENOBUFS errors are coming from the netlink layer and are likely a
> > sign of extreme load. I'm not very familiar with the audit userspace
> > code, but it might be helpful to see if you can increase the socket
> > buffer size for the audit daemon.
>
> I believe we tried increasing the buffer size in the toy repro and it
> didn't make much difference - perhaps doing it in the real audit daemon
> might help.

The reproducer is an artificial worst case since you are disconnecting
and reconnecting without a process shutdown and startup in between.
The reproducer just hammers the connection status as fast as the CPU
can generate netlink messages; I'm not certain there is a
netlink/socket buffer size that would fully resolve this for the
reproducer.

> > I'm also not surprised to hear that as you increase the number of CPUs
> > the problem goes away. With more CPUs the system is able to schedule
> > more threads simultaneously to maintain the kernel's audit queues and
> > execute the audit daemon to drain both the netlink socket buffer and
> > audit queues.
>
> My intuition told me the opposite - that more parallel activity would
> increase the chance of the socket buffer being crammed full before the
> one thread could return to the audit daemon and allow it to start
> processing events.

The audit subsystem can handle a full buffer, that has been stressed a
lot over the years, and while it may slow the system down it will
continue to operate. As a fun exercise, configure the audit subsystem
to audit *every* syscall and then try to shut the system down. It
will complete, but it's ugly and you'll see the audit subsystem
complain a *lot* about dropped records due to queue/backlog overflows.

The issue isn't so much about the queues overflowing inside the
kernel, it's about being able to schedule the audit daemon and/or
kernel thread to service the flood of connection
disconnects/reconnects coming from the reproducer.

> Does the size/number of audit queues perhaps scale
> with the number of CPUs?

The number of queues is static, and their size is determined at
runtime by the configuration. Making per-CPU queues would be a bit of
a mess due to our desire to ensure an accurate ordering of events
relative to a wall clock.

> > I suggest bringing this up with the audit userspace developer if you
> > haven't already. While we can, and should, improve things on the
> > kernel side (e.g. the patch we are discussing), it also sounds like
> > the userspace side has room for improvement as well.
>
> That sounds like a good idea - can you point me to who that is? I was
> under the impression that this was the mailing list for both the kernel
> and userspace sides.

That used to be the case, but unfortunately the audit userspace
developer who manages the original audit mailing list was unwilling to
make the list configuration changes we needed to play nice with
existing Linux kernel mailing list conventions so we needed to move
the kernel development discussions to a separate list.

The old audit mailing list, where the userspace development is still
discussed, can be found here:

* http://www.redhat.com/mailman/listinfo/linux-audit

--
paul-moore.com

Next message: Alex Williamson: "Re: [PATCH 1/2] vfio/mtty: Fix eventfd leak"
Previous message: Dorcas AnonoLitunya: "[PATCH v2 5/5] Staging: sm750fb: Rename ddk750_setModeTiming"
In reply to: Chris Riches: "Re: [RFC PATCH] audit: Send netlink ACK before setting connection in auditd_set"
Next in thread: Chris Riches: "Re: [RFC PATCH] audit: Send netlink ACK before setting connection in auditd_set"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]