Re: [PATCH 1/5] rcu/rcuperf: Add kfree_rcu() performance Tests

From: Joel Fernandes
Date: Thu Aug 29 2019 - 16:56:44 EST


On Wed, Aug 28, 2019 at 02:12:26PM -0700, Paul E. McKenney wrote:
> On Tue, Aug 27, 2019 at 03:01:55PM -0400, Joel Fernandes (Google) wrote:
> > This test runs kfree_rcu() in a loop to measure performance of the new
> > kfree_rcu() batching functionality.
> >
> > The following table shows results when booting with arguments:
> > rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1
> >
> > In addition, rcuperf.kfree_no_batch is used to toggle the batching of
> > kfree_rcu()s for a test run.
> >
> > patch applied GPs time (seconds)
> > yes 1732 14.5
> > no 9133 11.5
>
> This is really "rcuperf.kfree_no_batch" rather than "patch applied", right?
> (Yes, we did discuss this last time around, but this table combined with
> the prior paragraph is still ambiguous.) Please make it unambiguous.
> One way to do that is as follows:
>
> ------------------------------------------------------------------------
>
> The following table shows results when booting with arguments:
> rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000 rcuperf.kfree_rcu_test=1 rcuperf.kfree_no_batch=X
>
> rcuperf.kfree_no_batch=X # Grace Periods Test Duration (s)
> X=1 (old behavior) 9133 11.5
> X=0 (new behavior) 1732 14.5

Yes you are right, will fix. The reason I changed it to 'patch applied' is
because the last patch in the series removes kfree_no_batch. Will fix!
thanks!

> > On a 16 CPU system with the above boot parameters, we see that the total
> > number of grace periods that elapse during the test drops from 9133 when
> > not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
> > flood itself slows down a bit when batching, though, as shown.
>
> This last sentence would be more clear as something like: "However,
> use of batching increases the duration of the kfree_rcu()-flood test."
>
> > Note that the active memory consumption during the kfree_rcu() flood
> > does increase to around 200-250MB due to the batching (from around 50MB
> > without batching). However, this memory consumption is relatively
> > constant. In other words, the system is able to keep up with the
> > kfree_rcu() load. The memory consumption comes down considerably if
> > KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80.
>
> That would be a decrease rather than an increase in KFREE_DRAIN_JIFFIES,
> correct?
>
> This would also be a good place to mention that a later patch will
> decrease consumption, but that is strictly optional. However, you did
> introduce the topic of changing KFREE_DRAIN_JIFFIES, so if a later patch
> changes this value, this would be an excellent place to mention this.

Fixed.

[snip]
> > +/*
> > + * kfree_rcu() performance tests: Start a kfree_rcu() loop on all CPUs for number
> > + * of iterations and measure total time and number of GP for all iterations to complete.
> > + */
> > +
> > +torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu().");
> > +torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration.");
> > +torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees.");
> > +torture_param(int, kfree_no_batch, 0, "Use the non-batching (slower) version of kfree_rcu().");
> > +
> > +static struct task_struct **kfree_reader_tasks;
> > +static int kfree_nrealthreads;
> > +static atomic_t n_kfree_perf_thread_started;
> > +static atomic_t n_kfree_perf_thread_ended;
> > +
> > +struct kfree_obj {
> > + char kfree_obj[8];
> > + struct rcu_head rh;
> > +};
> > +
> > +static int
> > +kfree_perf_thread(void *arg)
> > +{
> > + int i, loop = 0;
> > + long me = (long)arg;
> > + struct kfree_obj *alloc_ptr;
> > + u64 start_time, end_time;
> > +
> > + VERBOSE_PERFOUT_STRING("kfree_perf_thread task started");
> > + set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
> > + set_user_nice(current, MAX_NICE);
> > +
> > + start_time = ktime_get_mono_fast_ns();
> > +
> > + if (atomic_inc_return(&n_kfree_perf_thread_started) >= kfree_nrealthreads) {
> > + if (gp_exp)
> > + b_rcu_gp_test_started = cur_ops->exp_completed() / 2;
>
> At some point, it would be good to use the new grace-period
> sequence-counter functions (rcuperf_seq_diff(), for example) instead of
> the open-coded division by 2. I freely admit that you are just copying
> my obsolete hack in this case, so not needed in this patch.

But I am using rcu_seq_diff() below in the pr_alert().

Anyway, I agree this can be a follow-on since this pattern is borrowed from
another part of rcuperf. However, I am also confused about the pattern
itself.

If I understand, you are doing the "/ 2" because expedited_sequence
progresses by 2 for every expedited batch.

But does rcu_seq_diff() really work on these expedited GP numbers, and will
it be immune to changes in RCU_SEQ_STATE_MASK? Sorry for the silly questions,
but admittedly I have not looked too much yet into expedited RCU so I could
be missing the point.

> > + else
> > + b_rcu_gp_test_finished = cur_ops->get_gp_seq();
> > +
> > + pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n",
> > + (unsigned long long)(end_time - start_time), kfree_loops,
> > + rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started));
> > + if (shutdown) {
> > + smp_mb(); /* Assign before wake. */
> > + wake_up(&shutdown_wq);
> > + }
> > + }
> > +
> > + torture_kthread_stopping("kfree_perf_thread");
> > + return 0;
> > +}
> > +
> > +static void
> > +kfree_perf_cleanup(void)
> > +{
> > + int i;
> > +
> > + if (torture_cleanup_begin())
> > + return;
> > +
> > + if (kfree_reader_tasks) {
> > + for (i = 0; i < kfree_nrealthreads; i++)
> > + torture_stop_kthread(kfree_perf_thread,
> > + kfree_reader_tasks[i]);
> > + kfree(kfree_reader_tasks);
> > + }
> > +
> > + torture_cleanup_end();
> > +}
> > +
> > +/*
> > + * shutdown kthread. Just waits to be awakened, then shuts down system.
> > + */
> > +static int
> > +kfree_perf_shutdown(void *arg)
> > +{
> > + do {
> > + wait_event(shutdown_wq,
> > + atomic_read(&n_kfree_perf_thread_ended) >=
> > + kfree_nrealthreads);
> > + } while (atomic_read(&n_kfree_perf_thread_ended) < kfree_nrealthreads);
> > +
> > + smp_mb(); /* Wake before output. */
> > +
> > + kfree_perf_cleanup();
> > + kernel_power_off();
> > + return -EINVAL;
>
> These last four lines should be combined with those of
> rcu_perf_shutdown(). Actually, you could fold the two functions together
> with only a pair of arguments and two one-line wrapper functions, which
> would be even better.

But the cleanup() function is different in the 2 cases and will have to be
passed in as a function pointer. I believe we discussed this last review as
well.

thanks,

- Joel