Re: Linux 2.6.29

From: Theodore Tso
Date: Mon Mar 30 2009 - 09:48:11 EST


On Sun, Mar 29, 2009 at 09:55:59PM -0600, Trenton D. Adams wrote:
> > (This is with a filesystem formated as ext3, and mounted as either
> > ext3 or ext4; if the filesystem is formatted using "mke2fs -t ext4",
> > what you see is a very smooth 1.2-1.5 seconds fsync latency, indirect
> > blocks for very big files end up being quite inefficient.)
>
> Oh. I thought I had read somewhere that mounting ext4 over ext3 would
> solve the problem. Not sure where I read that now. Sorry for wasting
> your time.

Well, I believe it should solve it for most realistic workloads (where
I don't think "dd if=/dev/zero of=bigzero.img" is realistic).

Looking more closely at the statistics, the delays aren't coming from
trying to flush the data blocks in data=ordered mode. If we disable
delayed allocation (mount -o nodelalloc), you'll see this when you
look at /proc/fs/jbd2/<dev>/history:

R/C tid wait run lock flush log hndls block inlog ctime write drop close
R 12 23 3836 0 1460 2563 50129 56 57
R 13 0 5023 0 1056 2100 64436 70 71
R 14 0 3156 0 1433 1803 40816 47 48
R 15 0 4250 0 1206 2473 57623 63 64
R 16 0 5000 0 1516 1136 61087 67 68

Note the amount of time in milliseconds in the flush column. That's
time spent flusing the allocated data blocks to disk. This goes away
once you enable delayed allocation:

R/C tid wait run lock flush log hndls block inlog ctime write drop close
R 56 0 2283 0 10 1250 32735 37 38
R 57 0 2463 0 13 1126 31297 38 39
R 58 0 2413 0 13 1243 35340 40 41
R 59 3 2383 0 20 1270 30760 38 39
R 60 0 2316 0 23 1176 33696 38 39
R 61 0 2266 0 23 1150 29888 37 38
R 62 0 2490 0 26 1140 35661 39 40

You may see slightly worse times since I'm running with a patch (which
will be pushed for 2.6.30) that makes sure that the blocks we are
writing during the "log" phase are written using WRITE_SYNC instead of
WRITE. (Without this patch, the huge amount of writes caused by the
VM trying to keep up with pages being dirtied at CPU speeds via "dd
if=/dev/zero..." will interfere with writes to the journal.)

During the log phase (which is averaging around 2 seconds for
nodealloc, and 1 seconds with delayed allocation enabled), we write
the metadata to the journal. The number of blocks that we are
actually writing to the journal is small (around 40 per transaction)
so I suspect we're seeing some lock contention or some accounting
overhead caused by the metadata blocks constantly getting dirtied by
dd if=/dev/zero task. We can look to see if this can be improved,
possibly by changing how we handle the locking, but it's no longer
being caused by the data=ordered flushing behaviour.

> Yes, I realize that. When trying to find performance problems I try
> to be as *unfair* as possible. :D

And that's a good thing from a development point of view when trying
to fix performance problems. When making statements about what people
are likely to find in real life, it's less useful.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/