Re: [PATCH RFC v1 1/2] rcu/tree: Add basic support for kfree_rcu batching

From: Byungchul Park
Date: Thu Aug 08 2019 - 06:27:41 EST


On Wed, Aug 07, 2019 at 05:45:04AM -0400, Joel Fernandes wrote:
> On Tue, Aug 06, 2019 at 04:56:31PM -0700, Paul E. McKenney wrote:

[snip]

> > On Tue, Aug 06, 2019 at 05:20:40PM -0400, Joel Fernandes (Google) wrote:
> > Of course, I am hoping that a later patch uses an array of pointers built
> > at kfree_rcu() time, similar to Rao's patch (with or without kfree_bulk)
> > in order to reduce per-object cache-miss overhead. This would make it
> > easier for callback invocation to keep up with multi-CPU kfree_rcu()
> > floods.
>
> I think Byungchul tried an experiment with array of pointers and wasn't
> immediately able to see a benefit. Perhaps his patch needs a bit more polish
> or another test-case needed to show benefit due to cache-misses, and the perf
> tool could be used to show if cache misses were reduced. For this initial
> pass, we decided to keep it without the array optimization.

I'm still seeing no improvement with kfree_bulk().

I've been thinking I could see improvement with kfree_bulk() because:

1. As you guys said, the number of cache misses will be reduced.
2. We can save (N - 1) irq-disable instructions while N kfrees.
3. As Joel said, saving/restoring CPU status that kfree() does inside
is not required.

But even with the following patch applied, the result was same as just
batching test. We might need to get kmalloc objects from random
addresses to maximize the result when using kfree_bulk() and this is
even closer to real practical world too.

And the second and third reasons doesn't seem to work as much as I
expected.

Do you have any idea? Or what do you think about it?

Thanks,
Byungchul

-----8<-----
diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 988e1ae..6f2ab06 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -651,10 +651,10 @@ struct kfree_obj {
return -ENOMEM;
}

- for (i = 0; i < kfree_alloc_num; i++) {
- if (!kfree_no_batch) {
- kfree_rcu(alloc_ptrs[i], rh);
- } else {
+ if (!kfree_no_batch) {
+ kfree_bulk(kfree_alloc_num, (void **)alloc_ptrs);
+ } else {
+ for (i = 0; i < kfree_alloc_num; i++) {
rcu_callback_t cb;

cb = (rcu_callback_t)(unsigned long)offsetof(struct kfree_obj, rh);