Another Linux performance anomaly with 1+ GB files

Scott Laird (laird@pacificrim.net)
Wed, 12 Nov 1997 01:29:35 -0800


We have a P6/200 system with 128 MB of RAM (with only 64 in use at
present, I'll re-fix lilo as soon as I can reboot) that we're using as
a Solid (http://www.solidtech.com/) database server. We've had
problems with weird system stalls in the past, all of which were
caused by fsync(2) taking over 100 seconds to run on large (> 1 GB)
files.

Now we seem to have a new problem, and I'm not sure if it's hardware
or software. It appears that read(2) system calls occationally take a
*long* time (10 minutes to an hour) to return. No SCSI errors (or any
other kernel errors) have been logged. I've seen this happen three
times in the last 90 minutes.

We just added another 4 GB drive to the system and created a ext2 fs
with 4k blocks. This is the third drive in the system, all on a
single Adaptec 2940. The other two drives are also ext2fs, but with
1k blocks. The system is running 2.0.31pre10.

According to strace, the solid database process hung in the middle of
this sequence of reads (strace -rT, first number is time since last
syscall, last number is time spent in syscall):

0.122090 lseek(5, 51707904, SEEK_SET) = 51707904 <0.000131>
0.000465 read(5, "\v}\10\0\0\5\35\340\0\0\0"..., 8192) = 8192 <0.051580>
0.052510 lseek(5, 57950208, SEEK_SET) = 57950208 <0.000131>
0.000468 read(5, "\vo\10\0\0\5\35\340\0\0\0"..., 8192) = 8192 <0.048295>
0.049060 lseek(5, 74989568, SEEK_SET) = 74989568 <0.000141>
0.000476 read(5, "\v\224\10\0\0\4\35\340\0\0"..., 8192) = 8192 <0.040003>
0.047869 lseek(5, 80928768, SEEK_SET) = 80928768 <0.000187>
0.000556 read(5, "\v\245\10\0\0\3\35\340\0"..., 8192) = 8192 <3176.731624>
3176.732664 lseek(5, 89088000, SEEK_SET) = 89088000 <0.000124>

The last call took almost an hour to return. The preceeding hang
(about an hour earlier) recovered and continured within 10 minutes.

According to ps, the solid process was blocked in WCHAN 1261d9:

FLAGS UID PID PPID PRI NI VSZ RSS WCHAN STAT TT TIME COMMA
100130 0 140 861 0 0 12240 12096 1261d9 D ? 8:27 /var/soli

According to System.map, that's somewhere in the middle of
try_to_free_buffer, if I'm reading ps right:

00125fb8 T generic_readpage
00126038 t grow_buffers
> 00126120 T try_to_free_buffer
0012635c T show_buffers
00126434 T buffer_init

Here's /proc/meminfo, for good measure:

MemTotal: 63608 kB
MemFree: 2168 kB
MemShared: 21904 kB
Buffers: 2084 kB
Cached: 31368 kB
SwapTotal: 389980 kB
SwapFree: 389976 kB

All /proc tuning parameters are at their default values. I haven't
tweaked anything yet.

The system itself is still responsive, and all three
drives/filesystems are accessible, as far as I can determine. There's
no indication of hardware failure. All three SCSI drives are
internal, the bus is properly terminated, and the termination hasn't
changed with the addition of the new drive (that is, it was added to
the chain between the terminator and the controller). There are no
external SCSI devices in the system.

The hanging solid server process has 18 open file descriptors,
according to /proc. Most of them are network sockets, a few are log
files, and numbers 4, 5, and 6 are database files. fd 4 is on
/dev/sda2 (an old 1k filesystem) and is right at the 2 GB filesize
limit, fd 5 is on the new sdc2 filesystem and over 1 GB. fd 6 is also
on the new filesystem and is is 0 bytes long at the moment.

It appears that there's some sort of problem with large files and
multiple filesystems with varying block sizes in Linux 2.0.x. Does
anyone have any suggested fixes? Is 2.1.x any better with large
files? Has anyone fixed fsync(2) in 2.1.x?

It's hard to see Linux as a viable platform for commercial database
servers when we have these sorts of performance problems (over 50
minutes for an 8k read!?) with large files. Can someone with a better
grip on the buffer cache please take a look?

While writing this, it hung again, in read(2) on fd 5, offset
947511296. This one's also WCHAN 1261d9.

Scott Laird
Pacific Rim Network, Inc.