Significant performance difference for postgres w/o sched_autogroup

From: Andres Freund
Date: Wed Jul 08 2015 - 11:46:01 EST


Hi,

while debugging a performance issue I noticed the following very odd
bit.

Kernel: 4.1.0-rc7-andres-00049-ge64f638
CPU: 2xE5520

Workload:

postgresql (multi-process via shared memory SQL server) and benchmark
client (pgbench, multi threaded) running on the same server. Connected
using unix sockets. The statements are relatively simple (~1.5ms on
average), forcing frequent back/forth between server processes and
pgbench threads.

I found that disabling sched_autogroup *significantly* reduces
throughput. Even when both server and client are started from the same
terminal and thus should be in the same group!

There's a significant difference in how %sys with autogroups
enabled/disabled. ~8% v ~27%. That sounds too much.

The difference in postgres performance is significant:
enabled:
progress: 1.0 s, 25857.2 tps, lat 1.171 ms stddev 3.314
progress: 2.0 s, 19582.5 tps, lat 1.586 ms stddev 5.349
progress: 3.0 s, 17796.3 tps, lat 1.804 ms stddev 7.364
progress: 4.0 s, 18736.7 tps, lat 1.696 ms stddev 7.612
progress: 5.0 s, 19269.9 tps, lat 1.640 ms stddev 6.835
progress: 6.0 s, 21389.6 tps, lat 1.467 ms stddev 5.663
progress: 7.0 s, 22153.8 tps, lat 1.417 ms stddev 8.417
progress: 8.0 s, 21550.7 tps, lat 1.487 ms stddev 7.460
progress: 9.0 s, 22602.3 tps, lat 1.401 ms stddev 6.738
progress: 10.0 s, 23475.5 tps, lat 1.327 ms stddev 7.111
tps = 21329.219141 (excluding connections establishing)

progress: 1.0 s, 25092.5 tps, lat 1.255 ms stddev 0.511
progress: 2.0 s, 13322.0 tps, lat 2.377 ms stddev 5.435
progress: 3.0 s, 13903.0 tps, lat 2.298 ms stddev 5.075
progress: 4.0 s, 13734.0 tps, lat 2.328 ms stddev 5.146
progress: 5.0 s, 14015.0 tps, lat 2.281 ms stddev 5.099
progress: 6.0 s, 14034.0 tps, lat 2.277 ms stddev 5.002
progress: 7.0 s, 13930.0 tps, lat 2.297 ms stddev 5.136
progress: 8.0 s, 14015.0 tps, lat 2.279 ms stddev 5.161
progress: 9.0 s, 14000.0 tps, lat 2.287 ms stddev 5.117
progress: 10.0 s, 13900.6 tps, lat 2.299 ms stddev 5.142
tps = 15006.317841 (excluding connections establishing)

(note the odd dip after the first second. That's what I was actually debugging)

A perf diff of -g -a -e cycles:k shows:
# Baseline Delta Shared Object Symbol
# ........ ....... ................... ..............................................
#
11.70% +3.87% [kernel.vmlinux] [k] unix_stream_recvmsg
6.74% +2.49% [kernel.vmlinux] [k] __fget
6.69% +4.31% [kernel.vmlinux] [k] sys_recvfrom
6.49% +1.77% [kernel.vmlinux] [k] system_call_after_swapgs
5.37% +1.39% [kernel.vmlinux] [k] _raw_spin_lock
4.63% +1.76% [kernel.vmlinux] [k] mutex_lock_interruptible
4.27% +1.27% [kernel.vmlinux] [k] mutex_unlock
3.43% +1.22% [kernel.vmlinux] [k] fput
3.41% -0.21% [kernel.vmlinux] [k] __fget_light
2.96% +1.33% libpthread-2.19.so [.] __libc_recv
2.56% +2.35% [kernel.vmlinux] [k] system_call_fastpath
2.36% +0.80% [kernel.vmlinux] [k] import_single_range
2.30% +0.82% [kernel.vmlinux] [k] sockfd_lookup_light

Looking a bit into the profile shows an apparent difference in how often
recv goes through the fastpath:

enabled:
- 6.55% pgbench [kernel.vmlinux] [k] sys_recvfrom
- sys_recvfrom
+ 85.88% system_call_fastpath
+ 14.12% __libc_recv

disabled:
- 10.97% pgbench [kernel.vmlinux] [k] sys_recvfrom
- sys_recvfrom
+ 74.16% system_call_fastpath
+ 25.84% __libc_recv

Is that actually expected or is there a bug somewhere? The difference
seems to be a bit large to me.

If interesting, I can try to script the workload in a reproducible
manner.

Regards,

Andres
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/