Where is the performance bottleneck?

From: Holger Kiehl
Date: Mon Aug 29 2005 - 13:22:35 EST


Hello

I have a system with the following setup:

Board is Tyan S4882 with AMD 8131 Chipset
4 Opterons 848 (2.2GHz)
8 GB DDR400 Ram (2GB for each CPU)
1 onboard Symbios Logic 53c1030 dual channel U320 controller
2 SATA disks put together as a SW Raid1 for system, swap and spares
8 SCSI U320 (15000 rpm) disks where 4 disks (sdc, sdd, sde, sdf)
are on one channel and the other four (sdg, sdh, sdi, sdj) on
the other channel.

The U320 SCSI controller has a 64 bit PCI-X bus for itself, there is no other
device on that bus. Unfortunatly I was unable to determine at what speed
it is running, here the output from lspci -vv:

02:04.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
Interrupt: pin A routed to IRQ 217
Region 0: I/O ports at 3000 [size=256]
Region 1: Memory at fe010000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at fe000000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
Address: 0000000000000000 Data: 0000
Capabilities: [68] PCI-X non-bridge device.
Command: DPERE- ERO- RBC=2 OST=0
Status: Bus=2 Dev=4 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple,

02:04.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-
Subsystem: LSI Logic / Symbios Logic: Unknown device 1000
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort
Latency: 72 (4250ns min, 4500ns max), Cache Line Size 10
Interrupt: pin B routed to IRQ 225
Region 0: I/O ports at 3400 [size=256]
Region 1: Memory at fe030000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at fe020000 (64-bit, non-prefetchable) [size=64K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable
Address: 0000000000000000 Data: 0000
Capabilities: [68] PCI-X non-bridge device.
Command: DPERE- ERO- RBC=2 OST=0
Status: Bus=2 Dev=4 Func=1 64bit+ 133MHz+ SCD- USC-, DC=simple,

How does one determine the PCI-X bus speed?

Anyway, I thought with this system I would get theoretically 640 MB/s using
both channels. I tested several software raid setups to get the best possible
write speeds for this system. But testing shows that the absolute maximum I
can reach with software raid is only approx. 270 MB/s for writting. Which is
very disappointing.

The tests where done with 2.6.12.5 kernel from kernel.org, scheduler is the
deadline and distribution is fedora core 4 x86_64 with all updates. Chunksize
is always the default from mdadm (64k). Filesystem was always created with the
command mke2fs -j -b4096 -O dir_index /dev/mdx.

I also have tried with 2.6.13-rc7, but here the speed was much lower, the
maximum there was approx. 140 MB/s for writting.

Here some tests I did and the results with bonnie++:

Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
Raid0 (8 disk)15744M 54406 96 247419 90 100752 25 60266 98 226651 29 830.2 1
Raid0s(4 disk)15744M 54915 97 253642 89 73976 18 59445 97 198372 24 659.8 1
Raid0s(4 disk)15744M 54866 97 268361 95 72852 17 59165 97 187183 22 666.3 1
Raid0p(4 disk)15744M 54017 96 149897 57 60202 15 59048 96 156887 20 381.8 1
Raid0p(4 disk)15744M 54771 98 156129 59 54130 14 58941 97 157543 20 520.3 1
Raid1+0 15744M 52496 94 202497 77 55928 14 60150 98 270509 34 930.2 1
Raid0+1 15744M 53927 95 194492 66 53430 15 49590 83 174313 30 884.7 1
Raid5 (8 disk)15744M 55881 98 153735 51 61680 24 56229 95 207348 44 741.2 1
Raid5s(4 disk)15744M 55238 98 81023 28 36859 14 56358 95 193030 38 605.7 1
Raid5s(4 disk)15744M 54920 97 83680 29 36551 14 56917 95 185345 35 599.8 1
Raid5p(4 disk)15744M 53681 95 54517 20 44932 17 54808 93 172216 33 371.1 1
Raid5p(4 disk)15744M 53856 96 55901 21 34737 13 55810 94 181825 36 607.7 1
/dev/sdc 15744M 53861 95 102270 35 25718 6 37273 60 76275 8 377.0 0
/dev/sdd 15744M 53575 95 96846 36 26209 6 37248 60 76197 9 378.4 0
/dev/sde 15744M 54398 94 87937 28 25540 6 36476 59 76520 8 380.4 0
/dev/sdf 15744M 53982 95 109192 38 26136 6 38516 63 76277 9 383.0 0
/dev/sdg 15744M 53880 95 102625 36 26458 6 37926 61 76538 9 399.1 0
/dev/sdh 15744M 53326 95 106447 39 26570 6 38129 62 76427 9 384.3 0
/dev/sdi 15744M 53103 94 96976 33 25632 6 36748 59 76658 8 386.4 0
/dev/sdj 15744M 53840 95 105521 39 26251 6 37146 60 76097 9 384.8 0

Raid1+0 - Four raid1's where each disk of one raid1 hangs on one
channel. The setup was done as follows:
Raid1 /dev/md3 (sdc + sdg)
Raid1 /dev/md4 (sdd + sdh)
Raid1 /dev/md5 (sde + sdi)
Raid1 /dev/md6 (sdf + sdj)
Raid0 /dev/md7 (md3 + md4 + md5 + md6)
Raid0+1 - Raid1 over two raid0 each having four disks:
Raid0 /dev/md3 (sdc + sdd + sde + sdf)
Raid0 /dev/md4 (sdg + sdh + sdi + sdj)
Raid1 /dev/md5 (md3 + md4)
Raid0s(4 disk) - Consists of Raid0 /dev/md3 sdc + sdd + sde + sdf or
Raid0 /dev/md4 sdg + sdh + sdi + sdj and the test where
done separate once for md3 and then for md4.
Raid0p(4 disk) - Same as Raid0s(4 disk) only the test for md3 and md4 where
done at the same time (parallel).
Raid5s(4 disk) - Same as Raid0s(4 disk) only with Raid5.
Raid5p(4 disk) - Same as Raid0p(4 disk) only with Raid5.

Additional tests where done with a little C program (attached to this mail)
that I wrote a long time ago. It measures the time it takes to write a file
of the given size, first result is without fsync() and second result with
fsync(). It is called with two parameters, the first is the file size in
Kilobytes and the second is the blocksize in bytes. The program was always
started as follows:

fw 16121856 4096

I choose 4096 as blocksize since this is value that is suggested by stat()
st_blksize. With larger values the transfer rate increases.

Here the results in MB/s:
Raid0 (8 disk) 203.017 191.649
Raid0s(4 disk) 200.331 166.129
Raid0s(4 disk) 198.013 165.465
Raid0p(4 disk) 143.781 118.832
Raid0p(4 disk) 146.592 117.703
Raid0+1 206.046 118.670
Raid5 (8 disk) 181.382 115.037
/dev/sdc 94.439 56.928
/dev/sdd 89.838 55.711
/dev/sde 84.391 51.545
/dev/sdf 87.549 57.368
/dev/sdg 92.847 57.799
/dev/sdh 94.615 58.678
/dev/sdi 89.030 54.945
/dev/sdj 91.344 56.899

Why do I only get 247 MB/s for writting and 227 MB/s for reading (from the
bonnie++ results) for a Raid0 over 8 disks? I was expecting to get nearly
three times those numbers if you take the numbers from the individual disks.

What limit am I hitting here?

Thanks,
Holger
--/*****************************************************************************/
/* File Write Performance */
/* ====================== */
/*****************************************************************************/

#include <stdio.h> /* printf() */
#include <string.h> /* strcmp() */
#include <stdlib.h> /* exit(), atoi(), calloc(), free() */
#include <unistd.h> /* write(), sysconf(), close(), fsync() */
#include <sys/times.h> /* times(), struct tms */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <stdarg.h>

#define MAXLINE 4096
#define BUFSIZE 512
#define DEFAULT_FILE_SIZE 31457280
#define TEST_FILE "test.file"
#define FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH)


static void err_doit(int, char *, va_list),
err_quit(char *, ...),
err_sys(char *, ...);


/*############################### main() ####################################*/
int
main(int argc, char *argv[])
{
register n,
loops,
rest;
int fd,
oflag,
blocksize = BUFSIZE;
off_t filesize = DEFAULT_FILE_SIZE;
clock_t start,
end,
syncend;
long clktck;
char *buf;
struct tms tmsdummy;

if ((argc > 1) && (argc < 5))
{
filesize = (off_t)atoi(argv[1]) * 1024;
if (argc == 3)
blocksize = atoi(argv[2]);
else if (argc == 4)
err_quit("Usage: %s [filesize] [blocksize]");
}
else if (argc != 1)
err_quit("Usage: %s [filesize] [blocksize]", argv[0]);

if ((clktck = sysconf(_SC_CLK_TCK)) < 0)
err_sys("sysconf error");

/* If clktck=0 it dosn't make sence to run the test */
if (clktck == 0)
{
(void)printf("0\n");
exit(0);
}

if ((buf = calloc(blocksize, sizeof(char))) == NULL)
err_sys("calloc error");

for (n = 0; n < blocksize; n++)
buf[n] = 'T';

loops = filesize / blocksize;
rest = filesize % blocksize;

oflag = O_WRONLY | O_CREAT;

if ((fd = open(TEST_FILE, oflag, FILE_MODE)) < 0)
err_quit("Could not open %s", TEST_FILE);

if ((start = times(&tmsdummy)) == -1)
err_sys("Could not get start time");

for (n = 0; n < loops; n++)
if (write(fd, buf, blocksize) != blocksize)
err_sys("write error");
if (rest > 0)
if (write(fd, buf, rest) != rest)
err_sys("write error");

if ((end = times(&tmsdummy)) == -1)
err_sys("Could not get end time");

(void)fsync(fd);

if ((syncend = times(&tmsdummy)) == -1)
err_sys("Could not get end time");

(void)close(fd);
free(buf);

(void)printf("%f %f\n", (double)filesize / ((double)(end - start) / (double)clktck),
(double)filesize / ((double)(syncend - start) / (double)clktck));

exit(0);
}


static void
err_sys(char *fmt, ...)
{
va_list ap;

va_start(ap, fmt);
err_doit(1, fmt, ap);
va_end(ap);
exit(1);
}


static void
err_quit(char *fmt, ...)
{
va_list ap;

va_start(ap, fmt);
err_doit(0, fmt, ap);
va_end(ap);
exit(1);
}


static void
err_doit(int errnoflag, char *fmt, va_list ap)
{
int errno_save;
char buf[MAXLINE];

errno_save = errno;
(void)vsprintf(buf, fmt, ap);
if (errnoflag)
(void)sprintf(buf+strlen(buf), ": %s", strerror(errno_save));
(void)strcat(buf, "\n");
fflush(stdout);
(void)fputs(buf, stderr);
fflush(NULL); /* Flushes all stdio output streams */
return;
}