Re: [kernel.org users] cannot ssh to master.kernel.org

From: J.H.
Date: Thu Sep 30 2010 - 19:21:18 EST


Hey everyone,

So this morning we found 'master' basically spinning off the hook, with
a consistently high load [READ: around 160ish]. Needless to say, mail
was getting differed due to the load and I'm sure ssh was having similar
issues.

Several things were noticed on the system:

- Several cron jobs had spun up multiple copies of themselves, despite
locking being present in the jobs. I'm not sure if they were attempting
to check the existing locks or if they were actually running through but
they were present.

- The loads were relatively high, but not seemingly because of disk i/o.
There were a great number of processes in the run state.

- Attempting to explicitly kill may of the processes left the processes
in a zombie state, but were still consuming CPU resources. Attempting
to kill them again did not result in the death of the processes, or the
relinquishing of the cpu resources. Attempting to strace the process
yielded nothing of interest / use. lsof on the zombie process did
return it's currently open file handles, including tcp connections.

- Disks all seemed to be readable and writeable.

- a sysrq+l dump of the system in this messed up state can be found at
http://pastebin.osuosl.org/35126 (this was originally requested by
Johannes Weiner)

- Perf was available in the kernel and userspace, however attempting to
run 'perf top' resulted in a stalled process sitting, seemingly, forever
in D+ state. (originally requested by Thomas Gleixner)

Considering that at one point running zombies that were eating cpu were
outnumbering the still running processes, the inability to get the loads
below 120 and the general mess of the machine, we finally bounced the
machine and let everything come back up.

One additional note, not necessarily related to the mess today, but
stuff we've been noticing.

- kswapd0 has been using a lot of cpu time. I wouldn't be concerned
about this if it was say, 10% of a cpu, or maybe even 20% of a cpu for a
short time. It has however been running in some cases at 100% cpu on a
single core for hours on end. This seems to happen, in particular,
under slightly higher loads particularly relating to when there are a
number of rsyncs going on simultaneously.

Replicating the rsyncs on a nearly identical box, running an older
2.6.30 kernel, did not see this much cpu usage of kswapd0. Johannes
Weiner was helping me look into it yesterday, but I don't think anything
was explicitly conclusive.

Anyway, thought I'd let everyone know what happened with the unexpected
outage this morning. Things seem to have settled somewhat and the
machine is up.

- John 'Warthog9' Hawley

On 09/30/2010 09:34 AM, Kevin Hilman wrote:
> As of this morning, I can no longer ssh to master.kernel.org to push git
> trees.
>
> Anyone else having ssh problems?
>
> Kevin
>
> _______________________________________________
> Users mailing list
> Users@xxxxxxxxxxxxxxxx
> http://linux.kernel.org/mailman/listinfo/users
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/