Re: [PATCH] ext4: reduce scheduling latency with delayed allocation

From: Michal Schmidt
Date: Wed Mar 10 2010 - 08:09:50 EST


On Mon, 1 Mar 2010 22:06:19 -0500 tytso@xxxxxxx wrote:
> On Mon, Mar 01, 2010 at 01:34:35PM +0100, Michal Schmidt wrote:
> > mpage_da_submit_io() may process tens of thousands of pages at a
> > time. Unless full preemption is enabled, it causes scheduling
> > latencies in the order of tens of milliseconds.
> >
> > It can be reproduced simply by writing a big file on ext4
> > repeatedly with dd if=/dev/zero of=/tmp/dummy bs=10M count=50
> >
> > The patch fixes it by allowing to reschedule in the loop.
>
> Have you tested for any performance regressions as a result of this
> patch, using some file system benchmarks?

I used the 'fio' benchmark to test sequential write speed. Here are the
results:

Test kernel aggregate bandwidth
------------------------------------------------------
hdd-multi 2.6.33.nopreempt 32.7 Â 3.5 MB/s
hdd-multi 2.6.33.reduce 33.8 Â 3.7 MB/s
hdd-multi 2.6.33.preempt 33.4 Â 3.1 MB/s

hdd-single 2.6.33.nopreempt 35.9 Â 2.1 MB/s
hdd-single 2.6.33.reduce 36.6 Â 2.3 MB/s
hdd-single 2.6.33.preempt 35.9 Â 2.0 MB/s

ramdisk-multi 2.6.33.nopreempt 189.7 Â 9.2 MB/s
ramdisk-multi 2.6.33.reduce 191.4 Â 9.5 MB/s
ramdisk-multi 2.6.33.preempt 163.5 Â 9.4 MB/s

ramdisk-single 2.6.33.nopreempt 152.3 Â 10.9 MB/s
ramdisk-single 2.6.33.reduce 171.3 Â 17.0 MB/s
ramdisk-single 2.6.33.preempt 144.2 Â 15.2 MB/s

The tests were run on a laptop with dual AMD Turion 2 GHz, 2 GB RAM.
A newly created filesystem was used for every fio run.
In the 'hdd' tests the filesystem was on a 24 GB LV on a harddisk. These
tests were repeated 12 times.
- In the '-single' variant a single process wrote a 5 GB file.
- In the '-multi' variant 5 processes wrote a 1 GB file each.
In the 'ramdisk' tests the filesystem was on a 1.5 GB ramdisk. These
tests were repeated >40 times.
- In the '-single' variant a single process wrote a 1400 MB file.
- In the '-multi' variant 5 processes wrote a 280 MB file each.
The kernels were:
'2.6.33.nopreempt' - vanilla 2.6.33 with CONFIG_PREEMPT_NONE
'2.6.33.reduce' - the same + the patch to add the cond_resched()
'2.6.33.preempt' - 2.6.33 with CONFIG_PREEMPT (for curiosity)
The data for 'aggregate bandwidth' were taken from fio's 'aggrb' result.
The margin of error as reported in the table is 2 * standard deviation.

Conclusion: Adding the cond_resched() did not result in any measurable
performance decrease of sequential writes. (The results show a
performance increase, but it's within the margin of error.)

> I don't think this is the best way to fix this problem, though. The
> real right answer is to change how the code is structued. All of the
> callsites that call mpage_da_submit_io() are immediately preceeded by
> mpage_da_map_blocks(). These two functions should be combined and
> instead of calling ext4_writepage() for each page,
> mpage_da_map_and_write_blocks() should make a single call to
> submit_bio() for each extent. That should far more CPU efficient,
> solving both your scheduling latency issue as well as helping out for
> benchmarks that strive to stress both the disk and CPU simultaneously
> (such as for example the TPC benchmarks).
>
> This will also make our blktrace results much more compact, and Chris
> Mason will be very happy about that!

You're almost certainly right, but I'm not likely to make such a change
in the near future.

Michal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/