Re: [RFC 0/9] Nginx refcount scalability issue with Apparmor enabled and potential solutions

From: Neeraj Upadhyay
Date: Tue Feb 06 2024 - 23:40:43 EST


Gentle ping.

John,

Could you please confirm that:

a. The AppArmor refcount usage described in the RFC is correct?
b. Approach taken to fix the scalability issue is valid/correct?


Thanks
Neeraj

On 1/10/2024 4:41 PM, Neeraj Upadhyay wrote:
> Problem Statement
> =================
>
> Nginx performance testing with Apparmor enabled (with nginx
> running in unconfined profile), on kernel versions 6.1 and 6.5
> show significant drop in throughput scalability, when Nginx
> workers are scaled to higher number of CPUs across various
> L3 cache domains.
>
> Below is one sample data on the throughput scalability loss,
> based on results on AMD Zen4 system with 96 CPUs with SMT
> core count 2; so, overall, 192 CPUs:
>
> Config Cache Domains apparmor=off apparmor=on
> scaling eff (%) scaling eff (%)
> 8C16T 1 100% 100%
> 16C32T 2 95% 94%
> 24C48T 3 94% 93%
> 48C96T 6 92% 88%
> 96C192T 12 85% 68%
>
> If we look at above data, there is a significant drop in
> scaling efficiency, when we move to 96 CPUs/192 SMT threads.
>
> Perf tool shows most of the contention coming from below
> 6.56% nginx [kernel.vmlinux] [k] apparmor_current_getsecid_subj
> 6.22% nginx [kernel.vmlinux] [k] apparmor_file_open
>
> The majority of the CPU cycles is found to be due to memory contention
> in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
> kref_put() operations on label.
>
> Commit 2516fde1fa00 ("apparmor: Optimize retrieving current task secid"),
> from 6.7 alleviates the issue to an extent, but not completely:
>
> Config Cache Domains apparmor=on apparmor=on (patched)
> scaling eff (%) scaling eff (%)
> 8C16T 1 100% 100%
> 16C32T 2 97% 93%
> 24C48T 3 94% 92%
> 48C96T 6 88% 88%
> 96C192T 12 65% 79%
>
> This adverse impact gets more pronounced when we move to >192 CPUs.
> The memory contention and impact increases with high frequency label
> update operations and labels are marked stale more frequently.
>
>
> Label Refcount Management
> =========================
>
> Apparmor uses label objects (struct aa_label) to manage refcounts for
> below set of objects:
>
> - Applicable profiles
> - Namespaces (unconfined profile)
> - Other non-profile references
>
> These label references are acquired on various apparmor lsm hooks,
> on operations such as file open, task kill operations, socket bind,
> and other file, socket, misc operations which use current task's cred,
> when the label for the current cred, has been marked stale. This is
> done to check these operations against the set of allowed operations
> for the task performing them.
>
> Use Percpu refcount for ref management?
> =======================================
>
> The ref put operations (percpu_ref_put()) in percpu refcount,
> in active mode, do not check whether ref count has dropped to
> 0. The users of the percpu_ref need to explicitly invoke
> a percpu_ref_kill() operation, to drop the initial reference,
> at shutdown paths. After the percpu_ref_kill() operation, ref
> switches to atomic mode and any new percpu_ref_put() operation
> checks for the drop to 0 case and invokes the release operation
> on that label.
>
> Labels are marked stale is situations like profile removal,
> profile updates. For a namespace, the unconfined label reference
> is dropped when the namespace is destroyed. These points
> are potential shutdown points for labels. However, killing
> the percpu ref from these points has few issues:
>
> - The label could still be referenced by tasks, which are
> still holding the reference to the now stale label.
> Killing the label ref while these operations are in progress
> will make all subsequent ref-put operations on the stale label
> to be atomic, which would still result in memory contention.
> Also, any new reference to the stale label, which is acquired
> with the elevated refcount will have atomic op contention.
>
> - The label is marked stale using a non-atomic write operation.
> It is possible that new operations do not observe this flag
> and still reference it for quite some time.
>
> - Explicitly tracking the shutdown points might not be maintainable
> at a per label granularity, as there can be various paths where
> label reference could get dropped, such as, before the label has
> gone live - object initialization error paths. Also, tracking
> the shutdown points for labels which reference other labels -
> subprofiles, merged labels requires careful analysis, and adds
> heavy burden on ensuring the memory contention is not introduced
> by these ref kill points.
>
>
> Proposed Solution
> =================
>
> One potential solution to the refcount scalability problem is to
> convert the label refcount to a percpu refcount, and manage
> the initial reference from kworker context. The kworker
> keeps an extra reference to the label and periodically scans
> labels and release them if their refcount drops to 0.
>
> Below is the sequence of operations, which shows the refcount
> management with this approach:
>
> 1. During label initialization, the percpu ref is initialized in
> atomic mode. This is done to ensure that, for cases where the
> label hasn't gone live (->ns isn't assigned), mostly during
> initialization error paths.
>
> 2. Labels are switched to percpu mode at various points -
> when a label is added to labelset tree, when a unconfined profile
> has been assigned a namespace.
>
> 3. As part of the initial prototype, only the in tree labels
> are managed by the kworker. These labels are added to a lockless
> list. The unconfined labels invoke a percpu_ref_kill() operation
> when the namespace is destroyed.
>
> 4. The kworker does a periodic scan of all the labels in the
> llist. It does below sequence of operations:
>
> a. Enqueue a dummy node to mark the start of scan. This dummy
> node is used as start point of scan and ensures that we
> there is no additional synchronization required with new
> label node additions to the llist. Any new labels will
> be processed in next run of the kworker.
>
> SCAN START PTR
> |
> v
> +----------+ +------+ +------+ +------+
> | | | | | | | |
> | head ------> dummy|--->|label |--->| label|--->NULL
> | | | node | | | | |
> +----------+ +------+ +------+ +------+
>
>
> New label addition:
>
> SCAN START PTR
> |
> v
> +----------+ +------+ +------+ +------+ +------+
> | | | | | | | | | |
> | head |--> label|--> dummy|--->|label |--->| label|--->NULL
> | | | | | node | | | | |
> +----------+ +------+ +------+ +------+ +------+
>
> b. Traverse through the llist, starting from dummy->next.
> If the node is a dummy node, mark it free.
> If the node is a label node, do,
>
> i) Switch the label ref to atomic mode. The ref switch wait
> for the existing percpu_ref_get() and percpu_ref_put()
> operations to complete, by waiting for a RCU grace period.
>
> Once the switch is complete, from this point onwards, any
> percpu_ref_get(), percpu_ref_put() operations use
> atomic operations.
>
> ii) Drop the initial reference, which was taken while adding
> the label node to the llist.
>
> iii) Use a percpu_ref_tryget() increment operation on the
> ref, to see if we dropped the last ref count. if we
> dropped the last count, we remove the node from the llist.
>
> All of these operations are done inside a RCU critical
> section, to avoid race with the release operations,
> which can potentially trigger, as soon as we drop
> the initial ref count.
>
> iv) If we didn't drop the last ref, switch back the counter
> to percpu mode.
>
> Using this approach, to move the atomic refcount manipulation out of the
> contended paths, there is a significant scalability improvement seen on
> nginx test, and scalability efficiency is close to apparmor-off case.
>
> Config Cache Domains apparmor=on (percpuref)
> scaling eff (%)
> 8C16T 1 100%
> 16C32T 2 96%
> 24C48T 3 94%
> 48C96T 6 93%
> 96C192T 12 90%
>
> Limitations
> ===========
>
> 1. Switching to percpu refcount increases memory size overhead, as
> percpu memory is allocated for all labels.
>
> 2. Deferring labels reclaim could potentially result in memory
> pressure, when there are high frequency of label update operations.
>
> 3. Percpu refcount uses call_rcu_hurry() to complete switch operations.
> These can impact energy efficiency, due to back to back hurry
> callbacks. Using deferrable workqueue partly mitigates this.
> However, deferring kworker can delay reclaims.
>
> 4. Back to back label switches can delay other percpu users, as
> there is a single global switch spinlock used by percpu refcount
> lib.
>
> 5. Long running kworker can delay other use cases like system suspend.
> This is mitigated using freezable workqueue and litming node
> scans to a max count.
>
> 6. There is a window where label operates is atomic mode, when its
> counter is being checked for zero. This can potentially result
> in high memory contention, during this window which spans RCU
> grace period (plus callback execution). For example, when
> scanning label corresponding to unconfined profile, all
> applications which use unconfined profile would be using
> atomic ref increment and decrement operations.
>
> There are a few apparoaches which were tried to mitigate this issue:
>
> a. At a lower time interval, check if scanned label's counter
> has changed since the start of label scan. If there is a change
> in count, terminate the switch to atomic mode. Below shows the
> apparoch using rcuwait.
>
> static void aa_label_switch_atomic_confirm(struct percpu_ref *label_ref)
> {
> WRITE_ONCE(aa_atomic_switch_complete, true);
> rcuwait_wake_up(&aa_reclaim_rcuwait);
> }
>
> rcuwait_init(&aa_reclaim_rcuwait);
> percpu_ref_switch_to_atomic(&label->count, aa_label_switch_atomic_confirm);
>
> atomic_count = percpu_ref_count_read(&label->count);
> do {
> rcuwait_wait_event_timeout(&aa_reclaim_rcuwait,
> (switch_complete = READ_ONCE(aa_atomic_switch_complete)),
> TASK_IDLE,
> msecs_to_jiffies(5));
> if (percpu_ref_count_read(&label->count) != atomic_count)
> break;
> } while (!READ_ONCE(switch_complete));
>
> However, this approach does not work, as percpu refcount lib does not
> allow termination of an ongoing switch operation. Also, the counter
> can return to the original value with set of get() and put() operations
> before we check the current value.
>
> b. Approaches to notify the reclaim kworker from ref get and put operations
> can potentially disturb cache line state between the various CPU
> contexts, which are referncing the label, and can potentially impact
> scalability again.
>
> c. Swith the label to an immortal percpu ref, while the scan operates
> on the current counter.
>
> Below is the sequence of operations to do this:
>
> 1. Ensure that both immortal ref and label ref are in percpu mode.
> Reinit the immortal ref in percpu mode.
>
> Swap percpu and atomic counters of label refcount and immortal ref
> percpu-ref
> +-------------------+
> +-------+ | percpu-ctr-addr1 |
> | label | --------->|-------------------| +----------------+
> +-------+ | data |--->| Atomic counter1|
> +-------------------+ +----------------+
> +-------+ +-------------------+
> |ImmLbl |---------->| percpu-ctr-addr2 | +----------------+
> +-------+ |-------------------|--->| Atomic counter2|
> | data | +----------------+
> +-------------------+
>
> label ->percpu-ctr-addr = percpu-ctr-addr2
> ImmLbl ->percpu-ctr-addr = percpu-ctr-addr1
> label ->data->count = Atomic counter2
> ImmLbl ->data->count = Atomic counter1
>
>
> 2. Check the counters collected in immortal label, by switch it
> to atomic mode.
>
> 3. If the count is 0, do,
> a. Switch immortal counter to percpu again, giving it an
> initial count of 1.
> b. Swap the label and immortal counters again. The immortal
> ref now has the counter values from new percpu ref get
> and get operations on the label ref, from the point
> when we did the initial swap operation.
> c. Transfer the percpu counts in immortal ref to atomic
> counter of label percpu refcount.
> d. Kill immortal ref, for reinit on next iteration.
> e. Switch label percpu ref to atomic mode.
> f. If the counter is 1, drop the initial ref.
>
> 4. If the count is not 0, re-swap the counters.
> a. Switch immortal counter to percpu again, giving it an
> initial count of 1.
> b. Swap the label and immortal counters again. The immortal
> ref now has the counter values from new percpu ref get
> and get operations on the label ref, from the point
> when we did the initial swap operation.
> c. Transfer the percpu counts in immortal ref to atomic
> counter of label percpu refcount.
> d. Kill immortal ref, for reinit on next iteration.
>
>
> Using this approach, we ensure that, label ref users do not switch
> to atomic mode, while there are active references on the label.
> However, this approach requires multiple percpu ref mode switches
> and adds high overhead and complexity to the scanning code.
>
> Extended/Future Work
> ====================
>
> 1. Look for ways to fix the limitations, as described in the "Limitations"
> section.
>
> 2. Generalize the approach to percpu rcuref, which is used for contexts
> where release path uses RCU grace period for release operations. Patch
> 7 creates an initial prototype for this.
>
> 3. Explore hazard pointers for scalable refcounting of labels.
>
> Highly appreciate any feedback/suggestions on the design approach.
>
> The patches of this patchset introduce following changes:
>
> 1. Documentation of Apparmor Refcount management.
>
> 2. Switch labels to percpu refcount in atomic mode.
>
> Use percpu refcount for apparmor labels. Initial patch to init
> the percpu ref in atomic mode, to evaluate the potential
> impact of percpuref on top of kref based implementation.
>
> 3. Switch unconfined namespaces refcount to percpu mode.
>
> Switch unconfined ns labels to percpu mode, and kill the
> initial refcount from namespace destroy path.
>
> 4. Add infrastructure to reclaim percpu labels.
>
> Add a label reclaim infrastructure for labels which are
> in percpu mode, for managing their inital refcount.
>
> 5. Switch intree labels to percpu mode.
>
> Use label reclaim infrastruture to manage intree labels.
>
> 6. Initial prototype for optimizing ref switch.
>
> Prototype for reducing the time window when a label
> scan switches the label ref to atomic mode.
>
> 7. percpu-rcuref: Add basic infrastructure.
>
> Prototype for Percpu refcounts for objects, which protect
> their object reclaims using RCU grace period.
>
> 8. Switch labels to percpu rcurefcount in unmanaged mode.
>
> Use percpu rcuref for labels. Start with unmanaged/atomic
> mode.
>
> 9. Switch unconfined and in tree labels to managed ref mode.
>
> Use percpu mode with manager worker for unconfined and intree
> labels.
>
>
> ------------------------------------------------------------------------
>
> b/Documentation/admin-guide/LSM/ApparmorRefcount.rst | 351 ++++++++++++++++++++++++++++++++++++++++++++++++++
> b/Documentation/admin-guide/LSM/index.rst | 1
> b/Documentation/admin-guide/kernel-parameters.txt | 8 +
> b/include/linux/percpu-rcurefcount.h | 115 ++++++++++++++++
> b/include/linux/percpu-refcount.h | 2
> b/lib/Makefile | 2
> b/lib/percpu-rcurefcount.c | 336 +++++++++++++++++++++++++++++++++++++++++++++++
> b/lib/percpu-refcount.c | 93 +++++++++++++
> b/security/apparmor/include/label.h | 16 +-
> b/security/apparmor/include/policy.h | 8 -
> b/security/apparmor/include/policy_ns.h | 24 +++
> b/security/apparmor/label.c | 11 +
> b/security/apparmor/lsm.c | 145 ++++++++++++++++++++
> b/security/apparmor/policy_ns.c | 6
> include/linux/percpu-refcount.h | 2
> lib/percpu-refcount.c | 93 -------------
> security/apparmor/include/label.h | 17 +-
> security/apparmor/include/policy.h | 56 +++----
> security/apparmor/include/policy_ns.h | 24 ---
> security/apparmor/label.c | 11 -
> security/apparmor/lsm.c | 325 ++++++++++++----------------------------------
> security/apparmor/policy_ns.c | 8 -
> 22 files changed, 1237 insertions(+), 417 deletions(-)
>
> base-commit: ab27740f7665