Re: Accept() problem

Andi Kleen (ak@muc.de)
17 Nov 1999 22:18:58 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Nat Lanza: "Re: ENH: SysRq Screen-capture?"
Previous message: Jeff Garzik: "Re: resending patch: Documentation/pci.txt"

admin@ztnet.com (Zachary Williams) writes:

> BUG: In a load balancing enviornment going through hardware such as a
> server-iron, apache can be flooded with requests (this could be due to a
> traffic spike) and will eventually stop responding. The load balancer will
> then remove that server from 'active' status, because it fails to respond to
> http health checks. In most cases, only a few of the children ever recieve
> requests. Because of the limited children actually responding to requests,
> the load-balancer will never put the server back to 'active' status,
> therefor leaving the server down, until its children are killed (killing the
> parent, and restarting apache. a -HUP WILL NOT WORK!).

It is probably filling up the SYN_RECV socket queue per listen socket
(1024 sockets per default). You can check by looking for SYN_RECV sockets
in netstat -t. Also make sure you have syncookies off before 2.2.13 (because
they have a deadly bug). You can increase the SYN_RECV queue size with
the /proc/sys/net/ipv4/tcp_max_syn_backlog sysctl.

>
> This is a difficult to reproduce bug, however, given a
> load-balancing setup, it is very obvious, because of the certain conditions
> met. (requests MUST stop going to the affected server, otherwise children
> will respawn, and act normally.) Single server users will notice a
> 'slowdown' period, that lasts anywhere from 30 seconds to a few minutes,
> while the system kicks back into gear.

The bug is clearly in the load balancer, not in the kernel.
It should not hit any server with such a load spike, and it should
not kill that http server that quickly.

-Andi

-- This is like TV. I don't like TV.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Nat Lanza: "Re: ENH: SysRq Screen-capture?"
Previous message: Jeff Garzik: "Re: resending patch: Documentation/pci.txt"