On Sat, Nov 23, 2002 at 09:29:22AM +1100, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Here is a partial run of contest (http://contest.kolivas.net) benchmarks for 
> rc2aa1 with the disk latency hack
> 
> noload:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [5]              71.7    93      0       0       0.98
> 2.4.19 [5]              69.0    97      0       0       0.94
> 2.4.20-rc1 [3]          72.2    93      0       0       0.99
> 2.4.20-rc1aa1 [1]       71.9    94      0       0       0.98
> 2420rc2aa1 [1]          71.1    94      0       0       0.97
> 
> cacherun:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [2]              66.6    99      0       0       0.91
> 2.4.19 [2]              68.0    99      0       0       0.93
> 2.4.20-rc1 [3]          67.2    99      0       0       0.92
> 2.4.20-rc1aa1 [1]       67.4    99      0       0       0.92
> 2420rc2aa1 [1]          66.6    99      0       0       0.91
> 
> process_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              109.5   57      119     44      1.50
> 2.4.19 [3]              106.5   59      112     43      1.45
> 2.4.20-rc1 [3]          110.7   58      119     43      1.51
> 2.4.20-rc1aa1 [3]       110.5   58      117     43      1.51*
> 2420rc2aa1 [1]          212.5   31      412     69      2.90*
> 
> This load just copies data between 4 processes repeatedly. Seems to take 
> longer.
you go into linux/include/blkdev.h and increase MAX_QUEUE_SECTORS to (2
<< (20 - 9)) and see if it makes any differences here? if it doesn't
make differences it could be the a bit increased readhaead but I doubt
it's the latter.
> ctar_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              117.4   63      1       7       1.60
> 2.4.19 [2]              106.5   70      1       8       1.45
> 2.4.20-rc1 [3]          102.1   72      1       7       1.39
> 2.4.20-rc1aa1 [3]       107.1   69      1       7       1.46
> 2420rc2aa1 [1]          103.3   73      1       8       1.41
> 
> xtar_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              150.8   49      2       8       2.06
> 2.4.19 [1]              132.4   55      2       9       1.81
> 2.4.20-rc1 [3]          180.7   40      3       8       2.47
> 2.4.20-rc1aa1 [3]       166.6   44      2       7       2.28*
> 2420rc2aa1 [1]          217.7   34      4       9       2.97*
> 
> Takes longer. Is only one run though so may not be an accurate average.
This most probably is a too small waitqueue. Of course increasing the
waitqueue will increase a bit the latency too for the other workloads,
it's a tradeoff and there's no way around it. Even read-latency has the
tradeoff when it chooses the "nth" place to be the seventh slot, where
to put the read request if it fails inserction.
> 
> 
> io_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              474.1   15      36      10      6.48
> 2.4.19 [3]              492.6   14      38      10      6.73
> 2.4.20-rc1 [2]          1142.2  6       90      10      15.60
> 2.4.20-rc1aa1 [1]       1132.5  6       90      10      15.47
> 2420rc2aa1 [1]          164.3   44      10      9       2.24
> 
> This was where the effect of the disk latency hack was expected to have an 
> effect. It sure did.
yes, I certainly can feel the machine much more responsive during the
write load too. Too bad some benchmark like dbench decreased
significantly but I don't see too many ways around it. At least now with
those changes the contigous write case is unaffected, my storage  test
box still reads and writes at over 100mbyte/sec for example, this
clearly means what matters is that we have 512k dma commands, not an
huge size of the queue. Really with a loaded machine and potential
scheduling delays it could matter more to have a larger queue, that
maybe why the performance is decreased for some workload here too, not
only because of a less effective elevator. So probably 2Mbyte of queue
is a much better idea, so at least we can have a ring with 4 elements to refill
after a completion wakeup, I wanted to be strict to see the "lowlatency" effect
at most in the first place. We could also consider to use a /4 instead of my
current /2 for the batch_sectors initialization.
BTW, at first glance it looks 2.5 has the same problem in the queue
sizing too.
> read_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              102.3   70      6       3       1.40
> 2.4.19 [2]              134.1   54      14      5       1.83
> 2.4.20-rc1 [3]          173.2   43      20      5       2.37
> 2.4.20-rc1aa1 [3]       150.6   51      16      5       2.06
> 2420rc2aa1 [1]          140.5   51      13      4       1.92
> 
> list_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              90.2    76      1       17      1.23
> 2.4.19 [1]              89.8    77      1       20      1.23
> 2.4.20-rc1 [3]          88.8    77      0       12      1.21
> 2.4.20-rc1aa1 [1]       88.1    78      1       16      1.20
> 2420rc2aa1 [1]          99.7    69      1       19      1.36
> 
> mem_load:
> Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
> 2.4.18 [3]              103.3   70      32      3       1.41
> 2.4.19 [3]              100.0   72      33      3       1.37
> 2.4.20-rc1 [3]          105.9   69      32      2       1.45
> 
> Mem load hung the machine. I could not get rc2aa1 through this part of the 
> benchmark no matter how many times I tried to run it. No idea what was going 
> on. Easy to reproduce. Simply run the mem_load out of contest (which runs 
> until it is killed) and the machine will hang. 
sorry but what is mem_load supposed to do other than to loop forever? It
is running for two days on my test box (512m of ram, 2G of swap, 4-way
smp) and nothing happened yet. It's an infinite loop. Sounds like you're
trapping a signal. Wouldn't it be simpler to just finish after a number
of passes? The machine is perfectly usable and responsive during the
mem_load, xmms doesn't skip a beat for istance, this is probably thanks
to the elevator-lowlatency too, I recall xmms wasn't used to be
completely smooth during heavy swapping in previous kernels (because the read()
of the sound file didn't return in rasonable time since I'm swapping in the
same hd where I store the data).
jupiter:~ # uptime
  4:20pm  up 1 day, 14:43,  3 users,  load average: 1.38, 1.28, 1.21
jupiter:~ # vmstat 1
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  0 197408   4504    112   1436  21  34    23    34   36    19   0   2  97
 0  1  0 199984   4768    116   1116 11712 5796 11720  5804  514   851   1   2  97
 0  1  0 234684   4280    108   1116 14344 12356 14344 12360  617  1034   0   3  96
 0  1  0 267880   4312    108   1116 10464 11916 10464 11916  539   790   0   3  97
 1  0  0 268704   5192    108   1116 6220 9336  6220  9336  363   474   0   1  99
 0  1  0 270764   5312    108   1116 13036 18952 13036 18952  584   958   0   1  99
 0  1  0 271368   5088    108   1116 8288 5160  8288  5160  386   576   0   1  99
 0  1  1 269184   4296    108   1116 4352 6420  4352  6416  254   314   0   0 100
 0  1  0 266528   4604    108   1116 9644 4652  9644  4656  428   658   0   1  99
there is no way I can reproduce any stability problem with mem_load here
(tested both on scsi quad xeon and ide dualathlon). Can you provide more
details of your problem and/or a SYSRQ+T during the hang? thanks.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Nov 30 2002 - 22:00:09 EST