problem: linux and (really) fast I/O

Davide Rossetti (rossetti@apemaia.roma1.infn.it)
Mon, 8 Jun 1998 20:41:18 +0200 (MET DST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: david parsons: "Re: 2.0.34 Kernel panics / Problems"
Previous message: Trever Adams: "2.1.105 and Sound (SB16 ASP)"

hi gurus,
i'm looking for some suggestions.
I have to up/download large amounts of memory from/to disk[s] to/from our
custom PCI boards.
Each board carries 32*8 MB of ram. We have 4 boards connected to each
Linux box. It counts up to 1GB (you can have a look at the links in my
home page for a really exciting superconputing project involving Linux).

We have to choose the final disk solution. We currently experiment
with a 2 SCSI controller, 4 disks (4G each one) striping setup. The
existence of a file system on disks is not vital; we can live even
with raw partitions. The focus is on the raw I/O performance.

My lowest PCI<->disk integrated target performance is 30MB/s.
PCI boards to memory performance is anyway necessary. so even though
disk I/O IS the real bottleneck, I want to optimize both PCI<->memory
and memory<->disk data paths.

-- at once the main questions ---

[ in the following, theese are my damaged brain production ]

A) is there a (possibly near) zero intermediate buffer copy I/O
schema from/to DMA-able PCI device to/from disks?

B) What is the strange I/O mesurementes on striped disks using both 2.0.33 and
2.1.9X kernels ? Am I missing something ?

Writing on the striped (RAID0) partition, I measured the same
throughput as with a single disk (writing 256 MB files).
Controllers and disks lights blink at a lower frequency than with
single disk access, when they are essentially always on (this is
somewhat strange!!)

Please, see below for my HW configuration and test benchmarks

C) How can I avoid buffer caching and consequent swapping during >200MB I/O
operation?
on our ~1 GB I/O operation, caching is not necessary.

--- more tech details ---

We already developed some mixed kernel/user mode drivers to control
out PCI boards, using both PIO and DMA transfer modes. PIO is a piety
as regards reads so we mainly focus on DMA.
In the following, I mention some (really confused) brainstorming on
the subject.

1) Using the simple IO schema (eg. for download):
- read from PCI toward a user buffer
- write to disk

the kernel driver essentially provides DMA-able memory chunks of the
dimention supported by our DMA logic

this schema inevitably incurs in the multiple copies overhead (but
you know this situation better than me, of course):
- data are DMA-ed to DMA-able buffer memory, manipulating out
board registers from user mode
- appended to the user buffer (all user mode),
- the user buffer is passed to the Clib write (a user->kernel
mode switch),
- data is cached
- it is written to the (possibly MD driver soft-striped) disks
(so the buffer is copied in private SCSI DMA-able buffers)

* A little variation (on the simple schema), to minimize user/kernel
mode switches:
usage of truncated/mmap-ped files.

It is extremely good for little files, but it seems to get worse
with bigger files as swapping (perhaps it is not the right word)
gets in.

* Another variation (on the simple schema)
DMA directly to/from the file mmaped region (or to the user
buffer and then do write...).

Here I really need to someway lock mmap-ped pages and to get the
'real' address to be passed to the PCI device. Moreover I'm limited
by the 4096 page size as the maximum DMA transaction size, unless i
tweak the mmap code to insure more physically contiguous
pages. Perhaps I even have to abandon our kernel/user mix mode
driver architecture

[ i followed recent discussion threads about DMA etc. and Linus was
suggesting that a dma_mmap kernel function could be useful. perhaps
this solution is in the same direction ]

this method avoids 1 copy pass. it does not cure the buffering and
caching problem I seem to have.

2) In the direction of minimizing buffer cache/swapping overhead, does
it make sense to write a special kernel driver which skips the
buffer cache and goes directly to the block devices ?

3) the most ambitious task I can think about
work out a complex system in which:
- we prepare DMA buffers exactly the same size as expected by the
SCSI layer (of course, leaving space on top and bottom for the
SCSI header/footer);
- DMA between our boards and these buffers;
- channeling of these buffers directly to the SCSI low level
layer.

of course, there I'd have a lot to think as to provide the proper
parameters to the scsi layer as I try to skip the upper layers; or
i could use upper level code to prepare the operation and then ...
(perhaps it's all shit)

4) [HW only] buy a >=1GB RAM equipped machine: of course costly but
perhaps even inpossible for our HW source

5) [HW only] integrate a SCSI controller directly on our PCI boards:
impossible from our technical point of view, at this project
stage.

NOTE: we use a Compact PCI set-up with:
- a 6U CPU card with:
Pentium 166MMX, 64MB, HX PCI chipset, 1 main PCI bus
with usual standard devices (USB, IDE, ..) plus 2 DEC
PCI-PCI bridges
- a double PCI bus backplane, in which the CPU board plugs
- in each PCI bus, at least an Adaptec AIC-7870 based SCSI
host adapters supporting fast-wide (20MB/s ?) connection to
2 fast-wide SEAGATE-ST34572W 4.3 GB HD
- each disk is four-fold partitioned

as regards striping figures i'm obtaining:

- each disk seem to be capable of ~ 10 MB/s throughput
- the striped device, both with ext2 and with raw (/dev/md?)
accesses, i expect that the striped device should reasonably
provide something between 25-30 MB/s and the 40 MB/s peak
value

performance test results are available at:
<A HREF=http://apemaia.roma1.infn.it/~rossetti/testraid3.txt>

for timing tests, i used a (surely too much) simple script:

#!/bin/bash
set -u
DISKS="/dev/md0 /dev/md1 /dev/md2 /dev/md3 /dev/sdb1 /dev/sdd1 /dev/sde1 /dev/sdf1"
BLKSIZE="64k 32k 16k 8k 4k"
COUNT="4000 2000 1000"
for dsk in $DISKS; do \
for blksize in $BLKSIZE; do \
for count in $COUNT; do \
sync
echo "disk=$dsk blksize=$blksize numblks=$count"
time dd if=/dev/zero of=$dsk bs=$blksize count=$count
echo ""
done
done
done

i know that /dev/zero is not the best. i tried also /dev/mem with
better results. anyway they serves only to give an idea.

i tried also writing a little program which use
mmap. results are varying but never satisfactorily good, always around
8-12 MB/s for small file sizes.

my striping setup is:

/dev/md0 raid0,4k,0,f2704f30 /dev/sdb1 /dev/sde1 /dev/sdd1 /dev/sdf1
/dev/md1 raid0,4k,0,3245dbc9 /dev/sdb2 /dev/sde2 /dev/sdd2 /dev/sdf2
/dev/md2 raid0,4k,0,79af7be9 /dev/sde3 /dev/sdf3
/dev/md3 raid0,4k,0,a6ab8fe6 /dev/sdb3 /dev/sdd3

sda & sdb are on PCI bus 1/SCSI bus 1, sde & sdf on PCI bus 2/SCSI bus 2

tried even with larger raid0 chunk sizes; 32k seems the best within a few
percent.

thanks in advance.
best regards.

+------------------------------------------------------------------+
|Rossetti Davide INFN - Sezione Roma I - gruppo V, prog. APEmille|
| web : http://apemaia.roma1.infn.it/~rossetti |
| """"" E-mail : davide.rossetti@roma1.infn.it |
| |o o| phone : (+39)-6-49914412 |
|--o00O-O00o-- fax : (+39)-6-4957697 |
| address: Dipartimento di Fisica (V.E.) |
| Universita' di Roma "La Sapienza" |
| P.le Aldo Moro,5 I - 00185 Roma - Italy|
| pgp pub. key: finger rossetti@apemaia.roma1.infn.it |
| |
|"Most people think about twice a year. I got famous by thinking |
| once a week." - George B. Shaw (see Appendix A) |
+------------------------------------------------------------------+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu

Next message: david parsons: "Re: 2.0.34 Kernel panics / Problems"
Previous message: Trever Adams: "2.1.105 and Sound (SB16 ASP)"