Re: howto combat highly pathologic latencies on a server?

From: Hans-Peter Jansen
Date: Wed Mar 10 2010 - 19:15:35 EST


On Wednesday 10 March 2010, 19:15:48 Christoph Hellwig wrote:
> On Wed, Mar 10, 2010 at 06:17:42PM +0100, Hans-Peter Jansen wrote:
> > While this system usually operates fine, it suffers from delays, that
> > are displayed in latencytop as: "Writing page to disk: 8425,5 ms":
> > ftp://urpla.net/lat-8.4sec.png, but we see them also in the 1.7-4.8 sec
> > range: ftp://urpla.net/lat-1.7sec.png, ftp://urpla.net/lat-2.9sec.png,
> > ftp://urpla.net/lat-4.6sec.png and ftp://urpla.net/lat-4.8sec.png.
> >
> > >From other observations, this issue "feels" like it is induced by
> > > single
> >
> > syncronisation points in the block layer, eg. if I create heavy IO load
> > on one RAID array, say resizing a VMware disk image, it can take up to
> > a minute to log in by ssh, although the ssh login does not touch this
> > area at all (different RAID arrays). Note, that the latencytop
> > snapshots above are made during normal operation, not this kind of
> > load..
>
> I had very similar issues on various systems (mostly using xfs, but some
> with ext3, too) using kernels before ~ 2.6.30 when using the cfq I/O
> scheduler. Switching to noop fixed that for me, or upgrading to a
> recent kernel where cfq behaves better again.

Christoph, thanks for this valuable suggestion: I've changed it to noop
right away, and also:

vm.dirty_ratio = 20
vm.dirty_background_ratio = 1

since the defaults of 40 and 10 seem to also not fit my needs. Even the 20
might be still oversized with 8GB total mem.

Thanks,
Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/