Re: BUG: Slowdown on 3000 socket-machines tracked down

From: Nick Piggin
Date: Mon Mar 07 2005 - 00:41:49 EST


Willy Tarreau wrote:
On Mon, Mar 07, 2005 at 04:14:37PM +1100, Nick Piggin wrote:


I think you would have better luck in reproducing this problem if you
did the full sendfile thing.

I think it is becoming disk bound due to page reclaim problems, which
is causing the slowdown.

In that case, writing the network only test would help to confirm the
problem is not a networking one - so not useless by any means.


Not necessarily, Nick. I have written an HTTP testing tool which matches
the description of Ben's : non-blocking, single-threaded, no disk I/O,
etc... It works flawlessly under 2.4, and gives me random numbers in 2.6,

No you're right, I'm not 100% sure, so I'm definitely not saying
Ben's test will be useless. Just that if it is not too hard to
make one with sendfile, I think he should.

If he makes a network-only version and cannot reproduce the problems,
that *doesn't* mean it is *not* a network problem. However if he
reproduces the problem with a full sendfile version and not the network
only one, then that is a better indicator... but I'm rambling.

especially if I start some CPU activity on the system, I can get pauses
of up to 13 seconds without this tool doing anything !!! At first I
believed it was because of the scheduler, but it might also be related
to what is described here since I had somewhat the same setup (gigE, 1500,
thousands of sockets). I never had enough time to investigate more, so I
went back to 2.4.


I have heard other complaints about this, and they are definitely
related to the scheduler (not saying yours is, but it is very possible).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/