We have to choose the final disk solution. We currently experiment
with a 2 SCSI controller, 4 disks (4G each one) striping setup. The
existence of a file system on disks is not vital; we can live even
with raw partitions. The focus is on the raw I/O performance.
My lowest PCI<->disk integrated target performance is 30MB/s.
PCI boards to memory performance is anyway necessary. so even though
disk I/O IS the real bottleneck, I want to optimize both PCI<->memory
and memory<->disk data paths.
-- at once the main questions ---
[ in the following, theese are my damaged brain production ]
A) is there a (possibly near) zero intermediate buffer copy I/O
schema from/to DMA-able PCI device to/from disks?
B) What is the strange I/O mesurementes on striped disks using both 2.0.33 and
2.1.9X kernels ? Am I missing something ?
Writing on the striped (RAID0) partition, I measured the same
throughput as with a single disk (writing 256 MB files).
Controllers and disks lights blink at a lower frequency than with
single disk access, when they are essentially always on (this is
somewhat strange!!)
Please, see below for my HW configuration and test benchmarks
C) How can I avoid buffer caching and consequent swapping during >200MB I/O
operation?
on our ~1 GB I/O operation, caching is not necessary.
--- more tech details ---
We already developed some mixed kernel/user mode drivers to control
out PCI boards, using both PIO and DMA transfer modes. PIO is a piety
as regards reads so we mainly focus on DMA.
In the following, I mention some (really confused) brainstorming on
the subject.
1) Using the simple IO schema (eg. for download):
- read from PCI toward a user buffer
- write to disk
the kernel driver essentially provides DMA-able memory chunks of the
dimention supported by our DMA logic
this schema inevitably incurs in the multiple copies overhead (but
you know this situation better than me, of course):
- data are DMA-ed to DMA-able buffer memory, manipulating out
board registers from user mode
- appended to the user buffer (all user mode),
- the user buffer is passed to the Clib write (a user->kernel
mode switch),
- data is cached
- it is written to the (possibly MD driver soft-striped) disks
(so the buffer is copied in private SCSI DMA-able buffers)
* A little variation (on the simple schema), to minimize user/kernel
mode switches:
usage of truncated/mmap-ped files.
It is extremely good for little files, but it seems to get worse
with bigger files as swapping (perhaps it is not the right word)
gets in.
* Another variation (on the simple schema)
DMA directly to/from the file mmaped region (or to the user
buffer and then do write...).
Here I really need to someway lock mmap-ped pages and to get the
'real' address to be passed to the PCI device. Moreover I'm limited
by the 4096 page size as the maximum DMA transaction size, unless i
tweak the mmap code to insure more physically contiguous
pages. Perhaps I even have to abandon our kernel/user mix mode
driver architecture
[ i followed recent discussion threads about DMA etc. and Linus was
suggesting that a dma_mmap kernel function could be useful. perhaps
this solution is in the same direction ]
this method avoids 1 copy pass. it does not cure the buffering and
caching problem I seem to have.
2) In the direction of minimizing buffer cache/swapping overhead, does
it make sense to write a special kernel driver which skips the
buffer cache and goes directly to the block devices ?
3) the most ambitious task I can think about
work out a complex system in which:
- we prepare DMA buffers exactly the same size as expected by the
SCSI layer (of course, leaving space on top and bottom for the
SCSI header/footer);
- DMA between our boards and these buffers;
- channeling of these buffers directly to the SCSI low level
layer.
of course, there I'd have a lot to think as to provide the proper
parameters to the scsi layer as I try to skip the upper layers; or
i could use upper level code to prepare the operation and then ...
(perhaps it's all shit)
4) [HW only] buy a >=1GB RAM equipped machine: of course costly but
perhaps even inpossible for our HW source
5) [HW only] integrate a SCSI controller directly on our PCI boards:
impossible from our technical point of view, at this project
stage.
NOTE: we use a Compact PCI set-up with:
- a 6U CPU card with:
Pentium 166MMX, 64MB, HX PCI chipset, 1 main PCI bus
with usual standard devices (USB, IDE, ..) plus 2 DEC
PCI-PCI bridges
- a double PCI bus backplane, in which the CPU board plugs
- in each PCI bus, at least an Adaptec AIC-7870 based SCSI
host adapters supporting fast-wide (20MB/s ?) connection to
2 fast-wide SEAGATE-ST34572W 4.3 GB HD
- each disk is four-fold partitioned
as regards striping figures i'm obtaining:
- each disk seem to be capable of ~ 10 MB/s throughput
- the striped device, both with ext2 and with raw (/dev/md?)
accesses, i expect that the striped device should reasonably
provide something between 25-30 MB/s and the 40 MB/s peak
value
performance test results are available at:
<A HREF=http://apemaia.roma1.infn.it/~rossetti/testraid3.txt>
for timing tests, i used a (surely too much) simple script:
#!/bin/bash
set -u
DISKS="/dev/md0 /dev/md1 /dev/md2 /dev/md3 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1"
BLKSIZE="64k 32k 16k 8k 4k"
COUNT="4000 2000 1000"
for dsk in $DISKS; do \
for blksize in $BLKSIZE; do \
for count in $COUNT; do \
sync
echo "disk=$dsk blksize=$blksize numblks=$count"
time dd if=/dev/zero of=$dsk bs=$blksize count=$count
echo ""
done
done
done
i know that /dev/zero is not the best. i tried also /dev/mem with
better results. anyway they serves only to give an idea.
i tried also writing a little program which use
mmap. results are varying but never satisfactorily good, always around
8-12 MB/s for small file sizes.
my striping setup is:
/dev/md0 raid0,4k,0,f2704f30 /dev/sdb1 /dev/sde1 /dev/sdd1 /dev/sdf1
/dev/md1 raid0,4k,0,3245dbc9 /dev/sdb2 /dev/sde2 /dev/sdd2 /dev/sdf2
/dev/md2 raid0,4k,0,79af7be9 /dev/sde3 /dev/sdf3
/dev/md3 raid0,4k,0,a6ab8fe6 /dev/sdb3 /dev/sdd3
sda & sdb are on PCI bus 1/SCSI bus 1, sde & sdf on PCI bus 2/SCSI bus 2
tried even with larger raid0 chunk sizes; 32k seems the best within a few
percent.
thanks in advance.
best regards.
+------------------------------------------------------------------+
|Rossetti Davide INFN - Sezione Roma I - gruppo V, prog. APEmille|
| web : http://apemaia.roma1.infn.it/~rossetti |
| """"" E-mail : davide.rossetti@roma1.infn.it |
| |o o| phone : (+39)-6-49914412 |
|--o00O-O00o-- fax : (+39)-6-4957697 |
| address: Dipartimento di Fisica (V.E.) |
| Universita' di Roma "La Sapienza" |
| P.le Aldo Moro,5 I - 00185 Roma - Italy|
| pgp pub. key: finger rossetti@apemaia.roma1.infn.it |
| |
|"Most people think about twice a year. I got famous by thinking |
| once a week." - George B. Shaw (see Appendix A) |
+------------------------------------------------------------------+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu