Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
From: Dave Chinner
Date: Fri May 23 2014 - 02:42:56 EST
On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> Hi Dave,
>
> On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > ping?
>
> I'd replied at http://marc.info/?l=linux-mm&m=139730910307321&w=2
I missed it, sorry.
I've had a bit more time to look at this behaviour now and tweaked
it as you suggested, but I simply can't get XFS to misbehave in the
manner you demonstrated. However, I can reproduce major read latency
changes and writeback flush storms with ext4. I originally only
tested on XFS. I'm using the no-op IO scheduler everywhere, too.
I ran the tweaked version I have for a couple of hours on XFS, and
only saw a handful abnormal writeback events where the write IOPS
spiked above the normal periodic peaks and was sufficient to cause
any noticable increase in read latency. Even then the maximums were
in the 40ms range, nothing much higher.
ext4, OTOH, generated a much, much higher periodic write IO load and
it's regularly causing read IO latencies in the hundreds of
milliseconds. Every so often this occurred on ext4 (5s sample rate)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 3.00 3142.20 219.20 34.11 19.10 32.42 1.11 0.33 0.33 0.31 0.27 91.92
vdc 0.00 0.80 3311.60 216.20 35.86 18.90 31.79 1.17 0.33 0.33 0.39 0.26 92.56
vdc 0.00 0.80 2919.80 2750.60 31.67 48.36 28.90 20.05 3.50 0.36 6.83 0.16 92.96
vdc 0.00 0.80 435.00 15689.80 4.96 198.10 25.79 113.21 7.03 2.32 7.16 0.06 99.20
vdc 0.00 0.80 2683.80 216.20 29.72 18.98 34.39 1.13 0.39 0.39 0.40 0.32 91.92
vdc 0.00 0.80 2853.00 218.20 31.29 19.06 33.57 1.14 0.37 0.37 0.36 0.30 92.56
Which is, i think, signs of what you'd been trying to demonstrate -
a major dip in read performance when writeback is flushing.
In comparison, this is from XFS:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 0.00 2416.40 335.00 21.02 7.85 21.49 0.78 0.28 0.30 0.19 0.24 65.28
vdc 0.00 0.00 2575.80 336.00 22.68 7.88 21.49 0.81 0.28 0.29 0.16 0.23 66.32
vdc 0.00 0.00 1740.20 4645.20 15.60 58.22 23.68 21.21 3.32 0.41 4.41 0.11 68.56
vdc 0.00 0.00 2082.80 329.00 18.28 7.71 22.07 0.81 0.34 0.35 0.26 0.28 67.44
vdc 0.00 0.00 2347.80 333.20 19.53 7.80 20.88 0.83 0.31 0.32 0.25 0.25 67.52
You can see how much less load XFS putting on the storage - it's
only 65-70% utilised compared to the 90-100% load that ext4 is
generating.
What is interesting here is the difference in IO patterns. ext4 is
doing much larger IOs than XFS - it's average IO size is 16k, while
XFS's is a bit over 8k. So while the read and background write IOPS
rates are similar, ext4 is moving a lot more data to/from disk in
larger chunks.
This seems also to translate to much larger writeback IO peaks in
ext4. I have no idea what this means in terms of actual application
throughput, but it looks very much to me like the nasty read
latencies are much more pronounced on ext4 because of the higher
read bandwidths and write IOPS being seen.
The screen shot of the recorded behaviour is attached - the left
hand side is the tail end (~30min) of the 2 hour long XFS run, and
the first half an hour of ext4 running. The difference in IO
behaviour is quite obvious....
What is interesting is that CPU usage is not very much different
between the two filesystems, but IOWait is much, much higher for
ext4. That indicates that ext4 is definitely loading the storage
more, and so much more likely to have IO load related
latencies..
So, seeing the differences in behvaiour just by changing
filesystems, I just ran the workload on btrfs. Ouch - it was
even worse than ext4 in terms of read latencies - they were highly
unpredictable, and massively variable even within a read group:
....
read[11331]: avg: 0.3 msec; max: 7.0 msec
read[11340]: avg: 0.3 msec; max: 7.1 msec
read[11334]: avg: 0.3 msec; max: 7.0 msec
read[11329]: avg: 0.3 msec; max: 7.0 msec
read[11328]: avg: 0.3 msec; max: 7.0 msec
read[11332]: avg: 0.6 msec; max: 4481.2 msec
read[11342]: avg: 0.6 msec; max: 4480.6 msec
read[11332]: avg: 0.0 msec; max: 0.7 msec
read[11342]: avg: 0.0 msec; max: 1.6 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
.....
It was also not uncommon to see major commit latencies:
read[11335]: avg: 0.2 msec; max: 8.3 msec
read[11341]: avg: 0.2 msec; max: 8.5 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.7 msec; max: 5302.3 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.0 msec; max: 6.1 msec
read[11337]: avg: 0.2 msec; max: 6.2 msec
So, what it appears to me right now is that filesystem choice alone
has a major impact on writeback behaviour and read latencies. Each
of these filesystems implements writeback aggregation differently
(ext4 completely replaces the generic writeback path, XFS optimises
below ->writepage, and btrfs has super magic COW powers).
That means it isn't clear that there's any generic infrastructure
problem here, and it certainly isn't clear that each filesystem has
the same problem or the issues can be solved by a generic mechanism.
I think you probably need to engage the ext4 developers drectly to
understand what ext4 is doing in detail, or work out how to prod XFS
into displaying that extremely bad read latency behaviour....
> As an additional note:
>
> > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > > I'm not sure how you were generating the behaviour you reported, but
> > > the test program as it stands does not appear to be causing any
> > > problems at all on the sort of storage I'd expect large databases to
> > > be hosted on....
>
> A really really large number of database aren't stored on big enterprise
> rigs...
I'm not using a big enterprise rig. I've reproduced these results on
a low end Dell server with the internal H710 SAS RAID and a pair of
consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
HBA with 12 2T nearline SAS drives in RAID0.
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/