Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

From: Chris Mason
Date: Thu Sep 17 2015 - 08:14:24 EST


On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > > [cc Tejun]
> > >
> > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > > # ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7
> > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> > > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> > > # Files info: size 4096 bytes, written with an IO size of 16384 bytes per write
> > > # App overhead is time in microseconds spent in the test not doing file writing related system calls.
> > >
> > > FSUse% Count Size Files/sec App Overhead
> > > 0 80000 4096 106938.0 543310
> > > 0 160000 4096 102922.7 476362
> > > 0 240000 4096 107182.9 538206
> > > 0 320000 4096 107871.7 619821
> > > 0 400000 4096 99255.6 622021
> > > 0 480000 4096 103217.8 609943
> > > 0 560000 4096 96544.2 640988
> > > 0 640000 4096 100347.3 676237
> > > 0 720000 4096 87534.8 483495
> > > 0 800000 4096 72577.5 2556920
> > > 0 880000 4096 97569.0 646996
> > >
> > > <RAM fills here, sustained performance is now dependent on writeback>
> >
> > I think too many variables have changed here.
> >
> > My numbers:
> >
> > FSUse% Count Size Files/sec App Overhead
> > 0 160000 4096 356407.1 1458461
> > 0 320000 4096 368755.1 1030047
> > 0 480000 4096 358736.8 992123
> > 0 640000 4096 361912.5 1009566
> > 0 800000 4096 342851.4 1004152
>
> <snip>
>
> > I can push the dirty threshold lower to try and make sure we end up in
> > the hard dirty limits but none of this is going to be related to the
> > plugging patch.
>
> The point of this test is to drive writeback as hard as possible,
> not to measure how fast we can create files in memory. i.e. if the
> test isn't pushing the dirty limits on your machines, then it really
> isn't putting a meaningful load on writeback, and so the plugging
> won't make significant difference because writeback isn't IO
> bound....

It does end up IO bound on my rig, just because we do eventually hit the
dirty limits. Otherwise there would be zero benefits in fs_mark from
any patches vs plain v4.2

But I setup a run last night with a dirty_ratio_bytes at 3G and
dirty_background_ratio_bytes at 1.5G.

There is definitely variation, but nothing like what you saw:

FSUse% Count Size Files/sec App Overhead
0 160000 4096 317427.9 1524951
0 320000 4096 319723.9 1023874
0 480000 4096 336696.4 1053884
0 640000 4096 257113.1 1190851
0 800000 4096 257644.2 1198054
0 960000 4096 254896.6 1225610
0 1120000 4096 241052.6 1203227
0 1280000 4096 214961.2 1386236
0 1440000 4096 239985.7 1264659
0 1600000 4096 232174.3 1310018
0 1760000 4096 250477.9 1227289
0 1920000 4096 221500.9 1276223
0 2080000 4096 235212.1 1284989
0 2240000 4096 238580.2 1257260
0 2400000 4096 224182.6 1326821
0 2560000 4096 234628.7 1236402
0 2720000 4096 244675.3 1228400
0 2880000 4096 234364.0 1268408
0 3040000 4096 229712.6 1306148
0 3200000 4096 241170.5 1254490
0 3360000 4096 220487.8 1331456
0 3520000 4096 215831.7 1313682
0 3680000 4096 210934.7 1235750
0 3840000 4096 218435.4 1258077
0 4000000 4096 232127.7 1271555
0 4160000 4096 212017.6 1381525
0 4320000 4096 216309.3 1370558
0 4480000 4096 239072.4 1269086
0 4640000 4096 221959.1 1333164
0 4800000 4096 228396.8 1213160
0 4960000 4096 225747.5 1318503
0 5120000 4096 115727.0 1237327
0 5280000 4096 184171.4 1547357
0 5440000 4096 209917.8 1380510
0 5600000 4096 181074.7 1391764
0 5760000 4096 263516.7 1155172
0 5920000 4096 236405.8 1239719
0 6080000 4096 231587.2 1221408
0 6240000 4096 237118.8 1244272
0 6400000 4096 236773.2 1201428
0 6560000 4096 243987.5 1240527
0 6720000 4096 232428.0 1283265
0 6880000 4096 234839.9 1209152
0 7040000 4096 234947.3 1223456
0 7200000 4096 231463.1 1260628
0 7360000 4096 226750.3 1290098
0 7520000 4096 213632.0 1236409
0 7680000 4096 194710.2 1411595
0 7840000 4096 213963.1 4146893
0 8000000 4096 225109.8 1323573
0 8160000 4096 251322.1 1380271
0 8320000 4096 220167.2 1159390
0 8480000 4096 210991.2 1110593
0 8640000 4096 197922.8 1126072
0 8800000 4096 203539.3 1143501
0 8960000 4096 193041.7 1134329
0 9120000 4096 184667.9 1119222
0 9280000 4096 165968.7 1172738
0 9440000 4096 192767.3 1098361
0 9600000 4096 227115.7 1158097
0 9760000 4096 232139.8 1264245
0 9920000 4096 213320.5 1270505
0 10080000 4096 217013.4 1324569
0 10240000 4096 227171.6 1308668
0 10400000 4096 208591.4 1392098
0 10560000 4096 212006.0 1359188
0 10720000 4096 213449.3 1352084
0 10880000 4096 219890.1 1326240
0 11040000 4096 215907.7 1239180
0 11200000 4096 214207.2 1334846
0 11360000 4096 212875.2 1338429
0 11520000 4096 211690.0 1249519
0 11680000 4096 217013.0 1262050
0 11840000 4096 204730.1 1205087
0 12000000 4096 191146.9 1188635
0 12160000 4096 207844.6 1157033
0 12320000 4096 208857.7 1168111
0 12480000 4096 198256.4 1388368
0 12640000 4096 214996.1 1305412
0 12800000 4096 212332.9 1357814
0 12960000 4096 210325.8 1336127
0 13120000 4096 200292.1 1282419
0 13280000 4096 202030.2 1412105
0 13440000 4096 216553.7 1424076
0 13600000 4096 218721.7 1298149
0 13760000 4096 202037.4 1266877
0 13920000 4096 224032.3 1198159
0 14080000 4096 206105.6 1336489
0 14240000 4096 227540.3 1160841
0 14400000 4096 236921.7 1190394
0 14560000 4096 229343.3 1147451
0 14720000 4096 199435.1 1284374
0 14880000 4096 215177.3 1178542
0 15040000 4096 206194.1 1170832
0 15200000 4096 215762.3 1125633
0 15360000 4096 194511.0 1122947
0 15520000 4096 179008.5 1292603
0 15680000 4096 208636.9 1094960
0 15840000 4096 192173.1 1237891
0 16000000 4096 212888.9 1111551
0 16160000 4096 218403.0 1143400
0 16320000 4096 207260.5 1233526
0 16480000 4096 202123.2 1151509
0 16640000 4096 191033.0 1257706
0 16800000 4096 196865.4 1154520
0 16960000 4096 210361.2 1128930
0 17120000 4096 201755.2 1160469
0 17280000 4096 196946.6 1173529
0 17440000 4096 199677.8 1165750
0 17600000 4096 194248.4 1234944
0 17760000 4096 200027.9 1256599
0 17920000 4096 206507.0 1166820
0 18080000 4096 215082.7 1167599
0 18240000 4096 201475.5 1212202
0 18400000 4096 208247.6 1252255
0 18560000 4096 205482.7 1311436
0 18720000 4096 200111.9 1358784
0 18880000 4096 200028.3 1351332
0 19040000 4096 198873.4 1287400
0 19200000 4096 209609.3 1268400
0 19360000 4096 203538.6 1249787
0 19520000 4096 203427.9 1294105
0 19680000 4096 201905.3 1280714
0 19840000 4096 209642.9 1283281
0 20000000 4096 203438.9 1315427
0 20160000 4096 199690.7 1252267
0 20320000 4096 185965.2 1398905
0 20480000 4096 203221.6 1214029
0 20640000 4096 208654.8 1232679
0 20800000 4096 212488.6 1298458
0 20960000 4096 189701.1 1356640
0 21120000 4096 198522.1 1361240
0 21280000 4096 203857.3 1263402
0 21440000 4096 204616.8 1362853
0 21600000 4096 196310.6 1266710
0 21760000 4096 203275.4 1391150
0 21920000 4096 205998.5 1378741
0 22080000 4096 205434.2 1283787
0 22240000 4096 195918.0 1415912
0 22400000 4096 186193.0 1413623
0 22560000 4096 192911.3 1393471
0 22720000 4096 203726.3 1264281
0 22880000 4096 204853.4 1221048
0 23040000 4096 222803.2 1153031
0 23200000 4096 198558.6 1346256
0 23360000 4096 201001.4 1278817
0 23520000 4096 206225.2 1270440
0 23680000 4096 190894.2 1425299
0 23840000 4096 198555.6 1334122
0 24000000 4096 202386.4 1332157
0 24160000 4096 205103.1 1313607

>
> > I do see lower numbers if I let the test run even
> > longer, but there are a lot of things in the way that can slow it down
> > as the filesystem gets that big.
>
> Sure, that's why I hit the dirty limits early in the test - so it
> measures steady state performance before the fs gets to any
> significant scalability limits....
>
> > > The baseline of no plugging is a full 3 minutes faster than the
> > > plugging behaviour of Linus' patch. The IO behaviour demonstrates
> > > that, sustaining between 25-30,000 IOPS and throughput of
> > > 130-150MB/s. Hence, while Linus' patch does change the IO patterns,
> > > it does not result in a performance improvement like the original
> > > plugging patch did.
> >
> > How consistent is this across runs?
>
> That's what I'm trying to work out. I didn't report it until I got
> consistently bad results - the numbers I reported were from the
> third time I ran the comparison, and they were representative and
> reproducable. I also ran my inode creation workload that is similar
> (but has not data writeback so doesn't go through writeback paths at
> all) and that shows no change in performance, so this problem
> (whatever it is) is only manifesting itself through data
> writeback....

The big change between Linus' patch and your patch is with Linus kblockd
is probably doing most of the actual unplug work (except for the last
super block in the list). If a process is waiting for dirty writeout
progress, it has to wait for that context switch to kblockd.

In the VM, that's going to hurt more then my big two socket mostly idle
machine.

>
> The only measurable change I've noticed in my monitoring graphs is
> that there is a lot more iowait time than I normally see, even when
> the plugging appears to be working as desired. That's what I'm
> trying to track down now, and once I've got to the bottom of that I
> should have some idea of where the performance has gone....
>
> As it is, there are a bunch of other things going wrong with
> 4.3-rc1+ right now that I'm working through - I haven't updated my
> kernel tree for 10 days because I've been away on holidays so I'm
> doing my usual "-rc1 is broken again" dance that I do every release
> cycle. (e.g every second boot hangs because systemd appears to be
> waiting for iscsi devices to appear without first starting the iscsi
> target daemon. Never happened before today, every new kernel I've
> booted today has hung on the first cold boot of the VM).

I've been doing 4.2 plus patches because rc1 didn't boot on this strange
box. Let me nail that down and rerun.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/