Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching
From: Uladzislau Rezki
Date: Tue Dec 10 2019 - 04:54:03 EST
On Wed, Sep 18, 2019 at 11:58:11AM +0200, Uladzislau Rezki wrote:
> > Recently a discussion about stability and performance of a system
> > involving a high rate of kfree_rcu() calls surfaced on the list [1]
> > which led to another discussion how to prepare for this situation.
> >
> > This patch adds basic batching support for kfree_rcu(). It is "basic"
> > because we do none of the slab management, dynamic allocation, code
> > moving or any of the other things, some of which previous attempts did
> > [2]. These fancier improvements can be follow-up patches and there are
> > different ideas being discussed in those regards. This is an effort to
> > start simple, and build up from there. In the future, an extension to
> > use kfree_bulk and possibly per-slab batching could be done to further
> > improve performance due to cache-locality and slab-specific bulk free
> > optimizations. By using an array of pointers, the worker thread
> > processing the work would need to read lesser data since it does not
> > need to deal with large rcu_head(s) any longer.
> >
According to https://lkml.org/lkml/2017/12/19/706 there was an attempt
to make use of kfree_bulk() interface. I have done some tests based on
your patch and enhanced kfree_bulk() logic. Basically storing pointers
in an array with a specific size makes sense to me and seems to others
as well. I mean in comparison with "pointer chasing" way, when there is
probably a cache misses each time the access is done to next element:
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1fe0418a5901..4f68662c1568 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2595,6 +2595,13 @@ EXPORT_SYMBOL_GPL(call_rcu);
/* Maximum number of jiffies to wait before draining a batch. */
#define KFREE_DRAIN_JIFFIES (HZ / 50)
+#define KFREE_BULK_MAX_SIZE 64
+
+struct kfree_rcu_bulk_data {
+ int nr_records;
+ void *records[KFREE_BULK_MAX_SIZE];
+ struct kfree_rcu_bulk_data *next;
+};
/*
* Maximum number of kfree(s) to batch, if this limit is hit then the batch of
@@ -2607,15 +2614,24 @@ struct kfree_rcu_cpu {
struct rcu_work rcu_work;
/* The list of objects being queued in a batch but are not yet
- * scheduled to be freed.
+ * scheduled to be freed. For emergency path only.
*/
struct rcu_head *head;
/* The list of objects that have now left ->head and are queued for
- * freeing after a grace period.
+ * freeing after a grace period. For emergency path only.
*/
struct rcu_head *head_free;
+ /*
+ * It is a block list that keeps pointers in the array of specific
+ * size which are freed by the kfree_bulk() logic. Intends to improve
+ * drain throughput.
+ */
+ struct kfree_rcu_bulk_data *bhead;
+ struct kfree_rcu_bulk_data *bhead_free;
+ struct kfree_rcu_bulk_data *bcached;
+
/* Protect concurrent access to this structure. */
spinlock_t lock;
@@ -2637,23 +2653,39 @@ static void kfree_rcu_work(struct work_struct *work)
{
unsigned long flags;
struct rcu_head *head, *next;
+ struct kfree_rcu_bulk_data *bhead, *bnext;
struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
struct kfree_rcu_cpu, rcu_work);
spin_lock_irqsave(&krcp->lock, flags);
head = krcp->head_free;
krcp->head_free = NULL;
+ bhead = krcp->bhead_free;
+ krcp->bhead_free = NULL;
spin_unlock_irqrestore(&krcp->lock, flags);
/*
* The head is detached and not referenced from anywhere, so lockless
* access is Ok.
*/
+ for (; bhead; bhead = bnext) {
+ bnext = bhead->next;
+ kfree_bulk(bhead->nr_records, bhead->records);
+
+ if (cmpxchg(&krcp->bcached, NULL, bhead))
+ kfree(bhead);
+
+ cond_resched_tasks_rcu_qs();
+ }
+
+ /*
+ * Emergency case only. It can happen under low
+ * memory condition when kmalloc gets failed, so
+ * the "bulk" path can not be temporary maintained.
+ */
for (; head; head = next) {
next = head->next;
- /* Could be possible to optimize with kfree_bulk in future */
__rcu_reclaim(rcu_state.name, head);
- cond_resched_tasks_rcu_qs();
}
}
@@ -2671,11 +2703,15 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
* another one, just refuse the optimization and it will be retried
* again in KFREE_DRAIN_JIFFIES time.
*/
- if (krcp->head_free)
+ if (krcp->bhead_free || krcp->head_free)
return false;
krcp->head_free = krcp->head;
krcp->head = NULL;
+
+ krcp->bhead_free = krcp->bhead;
+ krcp->bhead = NULL;
+
INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
queue_rcu_work(system_wq, &krcp->rcu_work);
@@ -2747,6 +2783,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
unsigned long flags;
struct kfree_rcu_cpu *krcp;
+ struct kfree_rcu_bulk_data *bnode;
/* kfree_call_rcu() batching requires timers to be up. If the scheduler
* is not yet up, just skip batching and do the non-batched version.
@@ -2754,16 +2791,35 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
return kfree_call_rcu_nobatch(head, func);
- head->func = func;
-
local_irq_save(flags); /* For safely calling this_cpu_ptr(). */
krcp = this_cpu_ptr(&krc);
spin_lock(&krcp->lock);
+ if (!krcp->bhead ||
+ krcp->bhead->nr_records == KFREE_BULK_MAX_SIZE) {
+ /* Need a new block. */
+ if (!(bnode = xchg(&krcp->bcached, NULL)))
+ bnode = kmalloc(sizeof(struct kfree_rcu_bulk_data),
+ GFP_ATOMIC | __GFP_NOWARN);
+
+ /* If gets failed, maintain the list instead. */
+ if (unlikely(!bnode)) {
+ head->func = func;
+ head->next = krcp->head;
+ krcp->head = head;
+ goto check_and_schedule;
+ }
+
+ bnode->nr_records = 0;
+ bnode->next = krcp->bhead;
+ krcp->bhead = bnode;
+ }
+
/* Queue the kfree but don't yet schedule the batch. */
- head->next = krcp->head;
- krcp->head = head;
+ krcp->bhead->records[krcp->bhead->nr_records++] =
+ (void *) head - (unsigned long) func;
+check_and_schedule:
/* Schedule monitor for timely drain after KFREE_DRAIN_JIFFIES. */
if (!xchg(&krcp->monitor_todo, true))
schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
See below some test results with/without this patch:
# HiKey 960 8xCPUs
rcuperf.ko kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1
[ 159.017771] Total time taken by all kfree'ers: 92783584881 ns, loops: 200000, batches: 5117
[ 126.862573] Total time taken by all kfree'ers: 70935580718 ns, loops: 200000, batches: 3953
Running the "rcuperf" shows approximately ~23% better throughput in case of using
"bulk" interface, so we have 92783584881 vs 70935580718 as total time. The "drain logic"
or its RCU callback does the work faster that leads to better throughput.
I can upload the RFC/PATCH of that change providing the test details and so on.
Any thoughts about it?
Thank you in advance!
--
Vlad Rezki