PostgreSQL pgbench performance regression in 2.6.23+

From: Greg Smith
Date: Wed May 21 2008 - 13:42:43 EST


PostgreSQL ships with a simple database benchmarking tool named pgbench, in what's labeled the contrib section (in many distributions it's a separate package from the main server/client ones). I see there's been some work done already improving how the PostgreSQL server works under the new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread). I wanted to provide you a different test case using pgbench that has taken a sharp dive starting with 2.6.23, and the server improvement changes in 2.6.25 actually made this problem worse.

I think it will be easy for someone else to replicate my results and I'll go over the exact procedure below. To start with a view of how bad the regression is, here's a summary of the results on one system, an AMD X2 4600+ running at 2.4GHz, with a few interesting kernels. I threw in results from Solaris 10 on this system as a nice independant reference point. The numbers here are transactions/second (TPS) running a simple read-only test over a 160MB data set, I took the median from 3 test runs:

Clients 2.6.9 2.6.22 2.6.24 2.6.25 Solaris
1 11173 11052 10526 10700 9656
2 18035 16352 14447 10370 14518
3 19365 15414 17784 9403 14062
4 18975 14290 16832 8882 14568
5 18652 14211 16356 8527 15062
6 17830 13291 16763 9473 15314
8 15837 12374 15343 9093 15164
10 14829 11218 10732 9057 14967
15 14053 11116 7460 7113 13944
20 13713 11412 7171 7017 13357
30 13454 11191 7049 6896 12987
40 13103 11062 7001 6820 12871
50 12311 11255 6915 6797 12858

That's the CentOS 4 2.6.9 kernel there, while the rest are stock ones I compiled with a minimum of fiddling from the defaults (just adding support for my SATA RAID card). You can see a major drop with the recent kernels at high client loads, and the changes in 2.6.25 seem to have really hurt even the low client count ones.

The other recent hardware I have here, an Intel Q6600 based system, gives even more maddening results. On successive benchmark runs, you can watch it break down only sometimes once you get just above 8 clients. At 10 and 15 clients, when I run it a few times, I'll sometimes get results in the good 25-30K TPS range, while others will give the 10K slow case. It's not a smooth drop off like in the AMD case, the results from 10-15 are really unstable. I've attached some files with 5 quick runs at each client load so you can see what I'm talking about. On that system I was also able to test 2.6.26-rc2 which doesn't look all that different from 2.6.25.

All these results are running everything on the server using the default local sockets-based interface, which is relevant in the real world because that's how a web app hosted on the same system will talk to the database. If I switch to connecting to the database over TCP/IP and run the pgbench client on another system, the extra latency drops the single client case to ~3100TPS. But the high client load cases are great--about 26K TPS at 50 clients. That result is attached as q6600-remote-2.6.25.txt, the remote client was running 2.6.20. Since recent PostgreSQL results were also fine with sysbench as the benchmark driver, this suggests the problem here is actually related to the pgbench client itself and how it gets scheduled relative to the server backends, rather than being inherent to the server.

Replicating the test results
----------------------------

Onto replicating my results, which I hope works because I don't have too much time to test potential fixed kernels myself (I was supposed to be working on the PostgreSQL code until this sidetracked me). I'll assume you can get the basic database going, if anybody needs help with that let me know. There is one server tunable that needs to be adjusted before you can get useful PostgreSQL benchmarks from this (and many other) tests. In the root of the database directory, there will be a file named postgresql.conf. Edit that and changed the setting for the shared_buffers parameter to 256MB to mimic my test setup. You may need to bump up shmmax (this is the one list where I'm happy I don't have to explain what that means!). Restart the server and check the logs to make sure it came back up, if shmmax is too low it will just tell you how big it needs to be and not start.

Now the basic procedure to run this test is:

-dropdb pgbench (if it's already there)
-createdb pgbench
-pgbench -i -s 10 pgbench (makes about a 160MB database)
-pgbench -S -c <clients> -t 10000 pgbench

The idea is that you'll have a large enough data set to not fit in L2 cache, but small enough that it all fits in PostgreSQL's dedicated memory (shared_buffers) so that it never has to ask the kernel to read a block. The "pgbench -i" initialization step will populate the server's memory and while that's all written to disk, it should stay in memory afterwards as well. That's why I use this as a general CPU/L2/memory test as viewed from a PostgreSQL context, and as you can see from my results with this problem it's pretty sensitive to whether your setup is optimal or not.

To make this easier to run against a range of client loads, I've attached a script (selecttest.sh) that does the last two steps in the above. That's what I used to generate all the results I've attached. If you've got the database setup such that you can run the psql client and pgbench is in your path, you should just be able to run that script and have it give you a set of results in a couple of minutes. You can adjust which client loads and how many times it runs each by editing the script.

Addendum: how pgbench works
----------------------------

pgbench works off "command scripts", which are a series of SQL commands with some extra benchmarking features implemented as a really simple programming language. For example, the SELECT-only test run above, what you get when passing -S to pgbench, is implemented like this:

\\set naccounts 100000 * :scale
\\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid = :aid;

Here :scale is detected automatically by doing a count of a table in the database.

The pgbench client runs as a single process. When pgbench starts, it iterates over each client, parsing the script until it hits a line that needs to be sent to the server. At that point, it issues that command as an asynchronous request, then returns to the main loop. Once every client is primed with a command, it enters a loop where it just waits for responses from them.

The main loop has all the open client connections in a fd_set. Each time a select() on that set says there's been a response to at least one of the clients from the server, it sweeps through all the clients and feeds the next script line to any that are ready for one. This proceeds until the target transaction count is reached.

This design is recognized as being only useful for smallish client loads. The results start dropping off very hard even on a fast machine with >100 simulated clients as the single pgbench process struggles to respond to everyone who is ready on each pass through all the clients who got responses. This makes pgbench particularly unsuitable for testing on systems with a large number of CPUs. I find pgbench just can't keep up with the useful number of clients possible somewhere between 8 and 16 cores. I'm hoping the PostgreSQL community can rewrite it in a more efficient way before the next release comes out now that such hardware is starting to show up more running this database. If that's the only way to resolve the issue outlined in this message, that's not intolerable, but a kernel fix would obviously be better.

I wanted to submit this here regardless because I'd really like for current versions to not have a big regression just because they were using a newer kernel, and it provides an interesting scheduler test case to add to the mix. The fact that earlier Linux kernels and alternate ones like Solaris give pretty consistant results here says this programming approach isn't impossible for a kernel to support well, I just don't think this specific type of load has been considered in the test cases for the new scheduler yet.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD#!/bin/bash

uname -pr

SCALE=10
TOTTRANS=100000

SETTIMES=3
SETCLIENTS="1 2 3 4 5 6 8 10 15 20 30 40 50"

TESTDB="pgbench"

pgbench -i -s $SCALE $TESTDB > /dev/null 2>&1

for C in $SETCLIENTS; do
T=1
while [ $T -le "$SETTIMES" ]; do
TRANS=`expr $TOTTRANS / $C`
pgbench -S -n -c $C -t $TRANS $TESTDB > results.txt
TPS=`grep "(including connections establishing)" results.txt | cut -d " " -f 3`
echo $C $TPS
T=$(( $T + 1))
done
done

rm -f results.txt
2.6.25 x86_64 (on server; client run on remote host with kernel 2.6.20-16)
1 3057.844199
1 3092.482163
1 3121.953364
2 5727.142659
2 5908.297317
2 5926.888628
3 9363.477540
3 9433.084801
3 9431.190712
4 13004.533641
4 12895.343840
4 12949.625568
5 15874.535293
5 16215.776199
5 15909.425730
6 18579.074963
6 18712.558182
6 18453.177986
8 20867.107616
8 20611.982116
8 20808.939187
10 22629.902429
10 22739.298715
10 22212.577028
15 26653.026061
15 25672.065614
15 26483.221996
20 27557.045841
20 26237.814831
20 28956.575850
30 23166.785331
30 26702.258997
30 28583.974107
40 27541.904319
40 25891.167513
40 26476.592971
50 26434.081991
50 25637.140628
50 26099.091465

2.6.25 x86_64
1 10330.660688
1 11271.754910
1 11282.125571
1 11256.340415
1 11325.051399
2 12504.737733
2 12588.134248
2 12441.831328
2 12447.620413
2 12593.846193
3 12628.665286
3 12766.801694
3 12797.020210
3 12959.703085
3 12905.702894
4 13958.284828
4 13977.373428
4 14109.186195
4 13034.869580
4 13005.338692
5 11994.961157
5 14766.047482
5 14344.018623
5 12404.053099
5 12007.859384
6 10916.289994
6 12145.067460
6 12109.840159
6 9693.585149
6 12180.340072
8 10810.231149
8 10837.233744
8 10799.744867
8 10839.094402
8 10816.589793
10 10655.716568
10 10643.532452
10 10609.845427
10 10615.836344
10 10645.945965
15 10277.499687
15 10207.888097
15 10193.409730
15 10217.082607
15 10207.900603
20 9719.168513
20 9715.113997
20 9718.205094
20 9701.906027
20 9690.018254
30 8899.177367
30 8844.672113
30 8868.549891
30 8879.713057
30 8884.936474
40 8361.219394
40 8350.369479
40 8363.908997
40 8348.133182
40 8344.822067
50 8095.186440
50 8095.049481
50 8131.078184
50 8096.018127
50 8090.840723
2.6.24.4 x86_64
1 11421.820154
1 11431.670391
1 11449.594192
1 11427.799562
1 11468.476484
2 14325.863542
2 14437.174685
2 14402.338248
2 14799.436556
2 14772.314319
3 19668.805474
3 19535.175389
3 19354.685119
3 19295.420668
3 19336.724384
4 22103.545829
4 22602.537542
4 21865.331424
4 21178.368668
4 22424.647019
5 26270.300375
5 26614.721827
5 26678.889155
5 27197.844190
5 25774.059440
6 27238.368411
6 27730.210861
6 27489.568666
6 28347.088836
6 27122.737466
8 27632.278480
8 28796.070834
8 29232.842514
8 28681.952426
8 28562.876030
10 31189.910688
10 30459.861670
10 30330.180410
10 30726.648362
10 10902.279165
15 10447.387234
15 25295.659944
15 10375.324430
15 10355.221697
15 11314.860580
20 9897.298701
20 9892.404276
20 9868.676534
20 8911.663139
20 9879.533903
30 9018.739658
30 9052.746303
30 9018.794160
30 9009.324773
30 9272.859955
40 8501.766072
40 8538.091714
40 8476.846342
40 8664.056995
40 8490.264553
50 8192.826361
50 8218.880626
50 8225.086398
50 8221.221900
50 8343.573679

2.6.22.19 x86_64
1 7623.484051
1 7625.915300
1 7589.468641
1 7570.584916
1 7652.315514
2 17702.824804
2 17369.699463
2 17222.642263
2 17593.340147
2 15637.517344
3 26377.325613
3 19256.513966
3 26813.207675
3 28777.228927
3 29432.081702
4 22640.938711
4 27589.357791
4 21602.130661
4 20272.457778
4 28949.123652
5 25815.538683
5 24871.847804
5 26238.117740
5 25570.425042
5 24551.637987
6 23901.788403
6 25105.699222
6 26229.009396
6 25517.620111
6 21909.853124
8 23674.903797
8 25231.645429
8 25255.745998
8 23869.783647
8 23818.807473
10 21703.371771
10 23839.408211
10 23185.911127
10 23665.093490
10 24717.906888
15 23421.502246
15 23403.340506
15 23329.025587
15 22730.765349
15 23207.747521
20 22480.887312
20 22635.157923
20 22511.885150
20 22223.832215
20 16553.580879
30 19407.089071
30 21718.108980
30 20645.888631
30 21650.537929
30 21993.984923
40 20098.232119
40 19562.630446
40 20236.880784
40 19181.712002
40 20835.781538
50 19043.951727
50 19859.900319
50 18122.228998
50 19467.880528
50 19921.715626

2.6.26-rc2 x86_64
1 11023.139112
1 11039.151787
1 10961.233297
1 11006.943841
1 11034.116274
2 11588.185104
2 11412.046785
2 11636.440818
2 11519.910495
2 10110.350431
3 12812.004251
3 13580.622648
3 13379.527058
3 13303.612765
3 13251.912767
4 13281.604142
4 13800.818582
4 12847.651013
4 12579.769893
4 12669.317510
5 13070.632785
5 12503.529121
5 12653.504407
5 12442.387082
5 11895.378717
6 12256.322309
6 12228.701519
6 12628.954679
6 12203.115311
6 12610.640729
8 11754.685359
8 11421.719702
8 10237.187443
8 11729.049572
8 11575.933726
10 11582.762487
10 11567.416116
10 11612.546009
10 11580.299826
10 11511.407517
15 11301.027547
15 11211.228675
15 11270.000193
15 11164.906529
15 11120.390151
20 11222.653060
20 10847.887130
20 11343.419297
20 11158.649437
20 11307.272182
30 10302.870811
30 10092.840200
30 10404.836485
30 10153.170822
30 10633.193429
40 10368.266304
40 10017.006874
40 9682.031437
40 10166.772689
40 10413.682496
50 9994.610906
50 9333.995176
50 9426.160782
50 9845.708881
50 10018.081636