ram0: 981MiB/s
non-bitmap: 132MiB/s
internal-bitmap: 95.5MiB/s
So, I waited for Paul to have a chance to give it a test for real disks,
still, results are similar to above.
You can see examples here:
https://github.com/brendangregg/FlameGraph
To be short, while test is running:
perf record -a -g -- sleep 10
perf script -i perf.data | ./stackcollapse-perf.pl | ./flamegraph.pl
BTW, you said that you're using production environment, this will
probably make it hard to analyze performance.