Follow up on Fileserving performance problem with recent 2.4 kernels

From: Robert Cohen (robert@coorong.anu.edu.au)
Date: Tue Jul 25 2000 - 03:35:16 EST


This is a follow-up to my recent message describing performance problems
in recent 2.4 kernels. However, I realised that many of the kernel
developers were away at OLS at the time I sent the original. So I am
including a full recap here since many of the people who hopefully would
have found my report useful might not have seen it.

I have been doing some evaluation of Linux as a Appleshare file server
using Netatalk. The benchmark I am using is Mac program called Lantest,
but the part of Lantest I am using just writes a 30 Meg file and then
reads it back and repeats this continuously.
I vary the number of client machines I run it on simultaneously with
each client reading and writing a separate file.
All tests are done over 100Mbit ethernet.

Over a variety of Linux kernels, I have found that above a certain
(fairly small) number of clients, there is a disastrous drop off in
write performance.
This is fairly clearly the result of excessive seeking showing that the
elevator isn't working well.
The symptoms are that the disk can be heard seeking madly, the disk
light is on continuously but the number of blocks transferred per second
according to vmstat drops by a factor of 10 (to 100-200 in a 5 sec
interval).

[Aside for people reading this who aren't disk performance experts:
The disk used can transfer approx 15 Megs per second for large
contiguous reads or writes but has a average seek time of 8 ms. So in
the worst case where instead of doing large contiguous writes, it writes
8k then seeks and writes again, the throughput drops to approx 1 Meg per
second.
This is what the elevator is designed to avoid or at least minimise].

If I put each client on a separate disk, the problem disappears.
Which is consistent with the problem being seek related. This
demonstrates that its not a problem in the appletalk stack, the
benchmark program itself, the protocols involved or in the buffer cache
etc.

The slightly odd thing is that performance does seem to be effected by
the
fact that its a networked test. The file serving gets handled by a
separate process for each client. I can strace these processes and see
that they are just doing an 8k read from the network followed by an 8k
write to disk. However, if I use the IOzone throughput test with
multiple
processes doing 8k reads and writes of 30 Meg files, I don't see the
same sort of performance problems.

My current theory is that waiting for the network read is causing a
context switch to the next process hence interleaving the writes from
all the processes. Under IOzone, each process is probably doing multiple
8k writes in its timeslice.

Anyway, here's a summary of my finding for each kernel version. I list
the number of simultaneous clients that can be supported before the
performance starts dropping drastically. This is on a machine with 128
Megs of memory.
With the number of clients shown, throughputs are between 5 and 10 Megs
per second, often saturating the ethernet. Once we hit the limit,
individual client throughputs drop to less than 300k per second with
aggregate throughput of less than 2 megs per second.

2.0.39: 1
2.2.15: 1
2.2.17pre13 1
2.4.0-test1 2
2.4.0-test1-ac22-riel 5+
2.4.0-test1-ac22-class 5+
2.4.0-test3 2
2.4.0-test4 2
2.4.0-test5pre4 2

This is a pretty damning result for the 2.0 and 2.2 kernels. Its not
much of a file server that can only support 1 client at a time.
The 2.4 kernels definitely benefit from the new elevator since all of
them can handle at least 2 clients.
The 2.4.0-test1-ac22 kernels were the high point. They handled all the
mac clients I had available. However there were still some indications
of write starvation as some of the clients would drop to 300k per second
for periods of 30 seconds or longer and then pick up.
However something has definitely gone wrong with the elevator in the
more recent 2.4 kernels. I suspect that except for the 2.4.0-test4-ac22
kernels there is a fairly serious bug in the elevator sorting/ merging
code.

Although this benchmark is a general one and wasn't designed to
specifically highlight a Linux problem, it appears by coincidence to
have caught a chink in Linux's armour.
We don't see these kinds of problems with benchmarks like dbench.
My theory is that the network read is allowing a context switch between
writes resulting in all write requests being perfectly interleaved.
As I understand it, the elevator even in the 2.2 kernels should have
coped OK with this. But it doesn't appear to be doing its job.

This should be a fairly easy benchmark to handle. The benchmark is just
writing and reading the same files and with the number of clients used
here, the files fit entirely in buffer cache. Therefore, there are no
reads being done. All file serving is coming straight out of the buffer
cache and we are only testing the write elevator.

If I reduce the memory available to 64 Megs so that files no longer fit
in buffer cache and the system has to start doing reads as well: even
the "good" 2.4.0-test1-ac22 kernels start falling apart with 4 clients.
I believe we should be able to handle far more clients than this.

I think this is a symptom of a different problem.
My guess is that elevator is being overwhelmed.
As the load rises, we start getting requests that would fail the latency
requirements if handled in elevator order. As we start doing these out
of order, you lose seek efficiency which slows things down and leads to
more requests that have to be handled out of order.
Once the latency requirements for disk ops start not being met, the
request queue fills up with operations past the latency time and the
algorithm falls back to first come, first served. Once the requests stop
being sorted, the system collapses into seeking madly.

Ideally we want a graceful degradation, not catastrophic failure. So
once the elevator sees enough requests that need to be handled out of
order, we need to fall back to looser latency guarantees or even abandon
the latency guarantees and handle requests in strict elevator order.

I was hoping to do some experimentation with elvtune and see how much
effect various elevator settings have on the performance. However, I
have been unable to get elvtune to run with any of the 2.4 kernels. The
ioctl which it uses gives an IO error. If anyone has suggestions as to
why elvtune isn't working, please let me know.

--
Robert Cohen
Unix Support
TLTSU, Australian National University

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jul 31 2000 - 21:00:18 EST