Re: [RFC 0/7] pstore: Improve performance of ftrace backend with ramoops

From: Steven Rostedt
Date: Tue Oct 11 2016 - 05:58:34 EST


On Fri, 7 Oct 2016 22:28:27 -0700
Joel Fernandes <joelaf@xxxxxxxxxx> wrote:

> Here's an early RFC for a patch series on improving ftrace throughput with
> ramoops. I am hoping to get some early comments so I'm releasing it in advance.
> It is functional and tested.
>
> Currently ramoops uses a single zone to store function traces. To make this
> work, it has to uses locking to synchronize accesses to the buffers. Recently
> the synchronization was completely moved from a cmpxchg mechanism to raw
> spinlocks due to difficulties in using cmpxchg on uncached memory and also on
> RAMs behind PCIe. [1] This change further dropped the peformance of ramoops
> pstore backend by more than half in my tests.
>
> This patch series improves the situation dramatically by around 280% from what
> it is now by creating a ramoops persistent zone for each CPU and avoiding use of
> locking altogether for ftrace. At init time, the persistent zones are then
> merged together.
>
> Here are some tests to show the improvements. Tested using a qemu quad core
> x86_64 instance with -mem-path to persist the guest RAM to a file. I measured
> avergage throughput of dd over 30 seconds:
>
> dd if=/dev/zero | pv | dd of=/dev/null
>
> Without this patch series: 24MB/s
> With per-cpu buffers and counter increment: 91.5 MB/s (improvement by ~ 281%)
> with per-cpu buffers and trace_clock: 51.9 MB/s
>
> Some more considerations:
> 1. Inorder to do the merge of the individual buffers, I am using racy counters
> since I didn't want to sacrifice throughput for perfect time stamps.
> trace_clock() for timestamps although did the job but was almost half the
> throughput of using counter based timestamp.
>
> 2. Since the patches divide the available ftrace persistent space by the number
> of CPUs, lesser space will now be available per-CPU however the user is free to
> disable per CPU behavior and revert to the old behavior by specifying
> PSTORE_PER_CPU flag. Its a space vs performance trade-off so if user has
> enough space and not a lot of CPUs, then using per-CPU persistent buffers make
> sense for better performance.
>
> 3. Without using any counters or timestamps, the improvement is even more
> (~140MB/s) but the buffers cannot be merged.
>
> [1] https://lkml.org/lkml/2016/9/8/375

>From a tracing point of view, I have no qualms with this patch set.

-- Steve

>
> Joel Fernandes (7):
> pstore: Make spinlock per zone instead of global
> pstore: locking: dont lock unless caller asks to
> pstore: Remove case of PSTORE_TYPE_PMSG write using deprecated
> function
> pstore: Make ramoops_init_przs generic for other prz arrays
> ramoops: Split ftrace buffer space into per-CPU zones
> pstore: Add support to store timestamp counter in ftrace records
> pstore: Merge per-CPU ftrace zones into one zone for output
>
> fs/pstore/ftrace.c | 3 +
> fs/pstore/inode.c | 7 +-
> fs/pstore/internal.h | 34 -------
> fs/pstore/ram.c | 234 +++++++++++++++++++++++++++++++++++----------
> fs/pstore/ram_core.c | 30 +++---
> include/linux/pstore.h | 69 +++++++++++++
> include/linux/pstore_ram.h | 6 +-
> 7 files changed, 280 insertions(+), 103 deletions(-)
>