Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching

From: Paul E. McKenney
Date: Tue Aug 13 2019 - 15:11:30 EST

Next message: Matthias Kaehlcke: "[PATCH v6 0/4] net: phy: Add support for DT configuration of PHY LEDs and use it for RTL8211E"
Previous message: Pierre-Louis Bossart: "Re: [PATCH v2 1/5] soundwire: Add compute_params callback"
In reply to: Joel Fernandes (Google): "[PATCH v3 2/2] rcuperf: Add kfree_rcu performance Tests"
Next in thread: Joel Fernandes: "Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Aug 13, 2019 at 01:00:45PM -0400, Joel Fernandes (Google) wrote:
> Recently a discussion about stability and performance of a system
> involving a high rate of kfree_rcu() calls surfaced on the list [1]
> which led to another discussion how to prepare for this situation.

Looks much improved! More questions and comments inline.

Thanx, Paul

> This patch adds basic batching support for kfree_rcu(). It is "basic"
> because we do none of the slab management, dynamic allocation, code
> moving or any of the other things, some of which previous attempts did
> [2]. These fancier improvements can be follow-up patches and there are
> different ideas being discussed in those regards.
>
> Torture tests follow in the next patch and show improvements of around
> 400% in reduction of number of grace periods on a 16 CPU system. More

s/400% in reduction/a 5x reduction/

Oddly enough, a reduction of more than 100% would require a negative
number of grace periods. RCU being what it is, although this seems
unlikely, I hesitate to rule it out. ;-)

> details and test data are in that patch.
>
> This is an effort to start simple, and build up from there. In the
> future, an extension to use kfree_bulk and possibly per-slab batching
> could be done to further improve performance due to cache-locality and
> slab-specific bulk free optimizations. By using an array of pointers,
> the worker thread processing the work would need to read lesser data
> since it does not need to deal with large rcu_head(s) any longer.

This should be combined with the second paragraph -- they both discuss
possible follow-on work.

> There is an implication with rcu_barrier() with this patch. Since the
> kfree_rcu() calls can be batched, and may not be handed yet to the RCU
> machinery in fact, the monitor may not have even run yet to do the
> queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
> to wait for those kfree_rcu()s that are already made. So this means a
> kfree_rcu() followed by an rcu_barrier() does not imply that memory will
> be freed once rcu_barrier() returns.

The usual approach (should kfree_rcu_barrier() in fact be needed) is to
record the address of the most recent block being kfree_rcu()'d on
each CPU, count the number of recorded addresses, and have the
kfree() side check the address, and do the usual atomic_dec_and_test()
and wakeup dance. This gets a bit more crazy should the kfree_rcu()
requests be grouped per slab, of course! So yes, good to avoid in
the meantime.

> Another implication is higher active memory usage (although not
> run-away..) until the kfree_rcu() flooding ends, in comparison to
> without batching. More details about this are in the second patch which
> adds an rcuperf test.

But this is adjustable, at least via patching at build time, via
KFREE_DRAIN_JIFFIES, right?

> Finally, in the near future we will get rid of kfree_rcu special casing
> within RCU such as in rcu_do_batch and switch everything to just
> batching. Currently we don't do that since timer subsystem is not yet up
> and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
> lock are not initialized. That would also mean getting rid of
> kfree_call_rcu_nobatch() entirely.

Please consistently put "()" after function names.

> [1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@xxxxxxxxxx
> [2] https://lkml.org/lkml/2017/12/19/824
>
> Cc: Rao Shoaib <rao.shoaib@xxxxxxxxxx>
> Cc: max.byungchul.park@xxxxxxxxx
> Cc: byungchul.park@xxxxxxx
> Cc: kernel-team@xxxxxxxxxxx
> Cc: kernel-team@xxxxxxx
> Co-developed-by: Byungchul Park <byungchul.park@xxxxxxx>
> Signed-off-by: Byungchul Park <byungchul.park@xxxxxxx>
> Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
>
> ---
> v2->v3: Just some code comment changes thanks to Byungchul.
>
> RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
> Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce OOM occurrence.
> Removed sleeps in rcuperf test, just using cond_resched()in loop.
> Better code comments ;)
>
> .../admin-guide/kernel-parameters.txt | 17 ++
> include/linux/rcutiny.h | 5 +
> include/linux/rcutree.h | 1 +
> kernel/rcu/tree.c | 204 +++++++++++++++++-
> 4 files changed, 221 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 7ccd158b3894..a9156ca5de24 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3895,6 +3895,23 @@
> test until boot completes in order to avoid
> interference.
>
> + rcuperf.kfree_rcu_test= [KNL]
> + Set to measure performance of kfree_rcu() flooding.
> +
> + rcuperf.kfree_nthreads= [KNL]
> + The number of threads running loops of kfree_rcu().
> +
> + rcuperf.kfree_alloc_num= [KNL]
> + Number of allocations and frees done in an iteration.
> +
> + rcuperf.kfree_loops= [KNL]
> + Number of loops doing rcuperf.kfree_alloc_num number
> + of allocations and frees.
> +
> + rcuperf.kfree_no_batch= [KNL]
> + Use the non-batching (slower) version of kfree_rcu.
> + This is useful for comparing with the batched version.
> +
> rcuperf.nreaders= [KNL]
> Set number of RCU readers. The value -1 selects
> N, where N is the number of CPUs. A value

This change to kernel-parameters.txt really needs to be in the
rcuperf patch rather than this one.

> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 8e727f57d814..383f2481750f 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -39,6 +39,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> call_rcu(head, func);
> }
>
> +static inline void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func)
> +{
> + call_rcu(head, func);
> +}
> +
> void rcu_qs(void);
>
> static inline void rcu_softirq_qs(void)
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 735601ac27d3..7e38b39ec634 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -34,6 +34,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
>
> void synchronize_rcu_expedited(void);
> void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
> +void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func);
>
> void rcu_barrier(void);
> bool rcu_eqs_special_set(int cpu);
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a14e5fbbea46..3e62bbb0e1fa 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2593,17 +2593,195 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
> }
> EXPORT_SYMBOL_GPL(call_rcu);
>
> +
> +/* Maximum number of jiffies to wait before draining a batch. */
> +#define KFREE_DRAIN_JIFFIES (HZ / 50)
> +
> /*
> - * Queue an RCU callback for lazy invocation after a grace period.
> - * This will likely be later named something like "call_rcu_lazy()",
> - * but this change will require some way of tagging the lazy RCU
> - * callbacks in the list of pending callbacks. Until then, this
> - * function may only be called from __kfree_rcu().
> + * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> + * kfree(s) is queued for freeing after a grace period, right away.
> */
> -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> +struct kfree_rcu_cpu {
> + /* The rcu_work node for queuing work with queue_rcu_work(). The work
> + * is done after a grace period.
> + */
> + struct rcu_work rcu_work;
> +
> + /* The list of objects being queued in a batch but are not yet
> + * scheduled to be freed.
> + */
> + struct rcu_head *head;
> +
> + /* The list of objects that have now left ->head and are queued for
> + * freeing after a grace period.
> + */
> + struct rcu_head *head_free;

So this is not yet the one that does multiple batches concurrently
awaiting grace periods, correct? Or am I missing something subtle?

If it is not doing the multiple-batch optimization, what does the
400% in the commit log refer to?

> + /* Protect concurrent access to this structure. */
> + spinlock_t lock;
> +
> + /* The delayed work that flushes ->head to ->head_free incase ->head
> + * within KFREE_DRAIN_JIFFIES. In case flushing cannot be done if RCU
> + * is busy, ->head just continues to grow and we retry flushing later.
> + */
> + struct delayed_work monitor_work;
> + bool monitor_todo; /* Is a delayed work pending execution? */
> +};
> +
> +static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
> +
> +/*
> + * This function is invoked in workqueue context after a grace period.
> + * It frees all the objects queued on ->head_free.
> + */
> +static void kfree_rcu_work(struct work_struct *work)
> +{
> + unsigned long flags;
> + struct rcu_head *head, *next;
> + struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
> + struct kfree_rcu_cpu, rcu_work);
> +
> + spin_lock_irqsave(&krcp->lock, flags);
> + head = krcp->head_free;
> + krcp->head_free = NULL;
> + spin_unlock_irqrestore(&krcp->lock, flags);
> +
> + /*
> + * The head is detached and not referenced from anywhere, so lockless
> + * access is Ok.
> + */
> + for (; head; head = next) {
> + next = head->next;
> + head->next = NULL;

Why do we need to NULL ->next? Could produce an expensive cache miss,
and I don't see how it helps anything. What am I missing here?

> + /* Could be possible to optimize with kfree_bulk in future */
> + __rcu_reclaim(rcu_state.name, head);
> + cond_resched_tasks_rcu_qs();
> + }
> +}
> +
> +/*
> + * Schedule the kfree batch RCU work to run in workqueue context after a GP.
> + *
> + * This function is invoked by kfree_rcu_monitor() when the KFREE_DRAIN_JIFFIES
> + * timeout has been reached.
> + */
> +static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
> +{
> + lockdep_assert_held(&krcp->lock);
> +
> + /* If a previous RCU batch work is already in progress, we cannot queue
> + * another one, just refuse the optimization and it will be retried
> + * again in KFREE_DRAIN_JIFFIES time.
> + */
> + if (krcp->head_free)
> + return false;
> +
> + krcp->head_free = krcp->head;
> + krcp->head = NULL;
> + INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
> + queue_rcu_work(system_wq, &krcp->rcu_work);
> +
> + return true;
> +}
> +
> +static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp,
> + unsigned long flags)
> +{
> + /* Flush ->head to ->head_free, all objects on ->head_free will be
> + * kfree'd after a grace period.
> + */
> + if (queue_kfree_rcu_work(krcp)) {
> + /* Success! Our job is done here. */
> + spin_unlock_irqrestore(&krcp->lock, flags);
> + return;
> + }
> +
> + /* Previous batch did not get free yet, let us try again soon. */
> + if (krcp->monitor_todo == false) {
> + schedule_delayed_work_on(smp_processor_id(),
> + &krcp->monitor_work, KFREE_DRAIN_JIFFIES);

Given that we are using a lock, do we need to restrict to the current
CPU? In any case, we do need to handle the case where the current
CPU has gone offline before we get around to sending a batch off to
wait for a grace period, correct?

> + krcp->monitor_todo = true;
> + }
> + spin_unlock_irqrestore(&krcp->lock, flags);
> +}
> +
> +/*
> + * This function is invoked after the KFREE_DRAIN_JIFFIES timeout has elapsed,
> + * so it drains the specified kfree_rcu_cpu structure's ->head list.
> + */
> +static void kfree_rcu_monitor(struct work_struct *work)
> +{
> + bool todo;
> + unsigned long flags;
> + struct kfree_rcu_cpu *krcp = container_of(work, struct kfree_rcu_cpu,
> + monitor_work.work);
> +
> + spin_lock_irqsave(&krcp->lock, flags);
> + todo = krcp->monitor_todo;
> + krcp->monitor_todo = false;
> + if (todo)
> + kfree_rcu_drain_unlock(krcp, flags);
> + else
> + spin_unlock_irqrestore(&krcp->lock, flags);
> +}
> +
> +/*
> + * This version of kfree_call_rcu does not do batching of kfree_rcu() requests.
> + * Used only by rcuperf torture test for comparison with kfree_rcu_batch().
> + */
> +void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func)
> {
> __call_rcu(head, func, -1, 1);
> }
> +EXPORT_SYMBOL_GPL(kfree_call_rcu_nobatch);
> +
> +/*
> + * Queue a request for lazy invocation of kfree() after a grace period.
> + *
> + * Each kfree_call_rcu() request is added to a batch. The batch will be drained
> + * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch
> + * will be kfree'd in workqueue context. This allows us to:
> + *
> + * 1. Batch requests together to reduce the number of grace periods during
> + * heavy kfree_rcu() load.
> + *
> + * 2. In the future, makes it possible to use kfree_bulk() on a large number of
> + * kfree_rcu() requests thus reducing the per-object overhead of kfree() and
> + * also reducing cache misses.
> + */
> +void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> + unsigned long flags;
> + struct kfree_rcu_cpu *krcp;
> + bool monitor_todo;
> +
> + /* kfree_call_rcu() batching requires timers to be up. If the scheduler
> + * is not yet up, just skip batching and do non-batched kfree_call_rcu().
> + */
> + if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
> + return kfree_call_rcu_nobatch(head, func);
> +
> + local_irq_save(flags);
> + krcp = this_cpu_ptr(&krc);
> +
> + spin_lock(&krcp->lock);
> +
> + /* Queue the kfree but don't yet schedule the batch. */
> + head->func = func;
> + head->next = krcp->head;
> + krcp->head = head;
> +
> + /* Schedule monitor for timely drain after KFREE_DRAIN_JIFFIES. */
> + monitor_todo = krcp->monitor_todo;
> + krcp->monitor_todo = true;
> + spin_unlock(&krcp->lock);
> +
> + if (!monitor_todo) {
> + schedule_delayed_work_on(smp_processor_id(),
> + &krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> + }
> + local_irq_restore(flags);
> +}
> EXPORT_SYMBOL_GPL(kfree_call_rcu);
>
> /*
> @@ -3455,10 +3633,24 @@ static void __init rcu_dump_rcu_node_tree(void)
> struct workqueue_struct *rcu_gp_wq;
> struct workqueue_struct *rcu_par_gp_wq;
>
> +static void __init kfree_rcu_batch_init(void)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> +
> + spin_lock_init(&krcp->lock);
> + INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> + }
> +}
> +
> void __init rcu_init(void)
> {
> int cpu;
>
> + kfree_rcu_batch_init();
> +
> rcu_early_boot_tests();
>
> rcu_bootup_announce();
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

Next message: Matthias Kaehlcke: "[PATCH v6 0/4] net: phy: Add support for DT configuration of PHY LEDs and use it for RTL8211E"
Previous message: Pierre-Louis Bossart: "Re: [PATCH v2 1/5] soundwire: Add compute_params callback"
In reply to: Joel Fernandes (Google): "[PATCH v3 2/2] rcuperf: Add kfree_rcu performance Tests"
Next in thread: Joel Fernandes: "Re: [PATCH v3 1/2] rcu/tree: Add basic support for kfree_rcu batching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]