Re: [problem] raid performance loss with 2.6.26-rc8 on 32-bit x86(bisected)

From: Dan Williams
Date: Thu Jul 03 2008 - 03:02:34 EST


On Tue, 2008-07-01 at 22:18 -0700, Mel Gorman wrote:
> > Let us take commit 8b3e6cdc out of the equation and just look at raid0
> > performance:
> >
> > revision 2.6.25.8-fc8 54a6eb5c 54a6eb5c-nohighmem 2.6.26-rc8
> > 279 278 273 277
> > 281 278 275 277
> > 281 113 68.7 66.8
> > 279 69.2 277 73.7
> > 278 75.6 62.5 80.3
> > MB/s (avg) 280 163 191 155
> > % change 0% -42% -32% -45%
> > result base bad bad bad
> >
>
> Ok, based on your other mail, 54a6eb5c here is a bisection point. The good
> figures are on par with the "good" kernel with some disasterous runs leading
> to a bad average. The thing is, the bad results are way worse than could be
> accounted for by two-zonelist alone. In fact, the figures look suspiciously
> like only 1 disk is in use as they are roughly quartered. Can you think of
> anything that would explain that?

raid0 in contrast to raid5 is fairly simple it just remaps the bio.
Then it is up to the block layer to do some merging before it hits the
block device driver. Not much room for error, especially since single
drive performance seems to be unaffected. The data seems to show that
the vm is not queuing pages to be written back fast enough.

> Can you also confirm that using a bisection
> point before two-zonelist runs steadily and with high performance as expected
> please? This is to rule out some other RAID patch being a factor.

The immediately preceeding commit 18ea7e71 always runs with high
performance, and raid0 was virtually untouched during the merge window:

$ git shortlog v2.6.25..v2.6.26-rc8 drivers/md/raid0.c
Neil Brown (1):
Remove blkdev warning triggered by using md

> It would be worth running vmstat during the tests so we can see if IO
> rates are dropping from an overall system perspective. If possible,
> oprofile data from the same time would be helpful to see does it show up
> where we are getting stuck.

I watched /proc/vmstat during a dd run, and in the good case
nr_writeback never drops below 21,000. In the bad case it may briefly
break a few hundred pages.

This picture generated by seekwatcher shows that there are gaps in i/o
submission:

http://userweb.kernel.org/~djbw/md0-bad2.png

Compared to a good run:

http://userweb.kernel.org/~djbw/md0-good.png

What is interesting is that I was only able to get this data with the
following command:

./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb

When I tried to watch all four disks, i.e. 5 blktrace tasks running
instead of 2:

./seekwatcher -W -t md0.trace -o md0-bad2.png -p 'dd if=/dev/zero of=/dev/md0 bs=1024k count=2048' -d /dev/md0 -d /dev/sdb -d /dev/sdc -d /dev/sdd -d /dev/sde

...it always measured good performance, as if having more tasks running
stimulates the vm to keep writeback going at a high rate. Could
writeback be getting stalled on one cpu or more cpu's?

Oprofile only seems to show that we are idling in the failure case:
---good---
samples % image name symbol name
18229 20.7171 vmlinux-18ea7e71 __copy_from_user_ll_nocache_nozero
4518 5.1347 vmlinux-18ea7e71 clear_user
2137 2.4287 vmlinux-18ea7e71 get_page_from_freelist
1929 2.1923 vmlinux-18ea7e71 constant_test_bit
1745 1.9832 vmlinux-18ea7e71 clear_bit
1320 1.5002 vmlinux-18ea7e71 kmem_cache_alloc
1290 1.4661 vmlinux-18ea7e71 blk_rq_map_sg
1287 1.4627 vmlinux-18ea7e71 bio_put
1022 1.1615 raid0.ko raid0_make_request
938 1.0660 vmlinux-18ea7e71 __make_request
938 1.0660 vmlinux-18ea7e71 test_clear_page_writeback
899 1.0217 vmlinux-18ea7e71 __percpu_counter_add
---bad---
samples % image name symbol name
17844 16.3714 vmlinux-54a6eb5c __copy_from_user_ll_nocache_nozero
4565 4.1883 vmlinux-54a6eb5c clear_user
2229 2.0450 vmlinux-54a6eb5c get_page_from_freelist
2202 2.0203 vmlinux-54a6eb5c mwait_idle
2129 1.9533 vmlinux-54a6eb5c schedule
1882 1.7267 vmlinux-54a6eb5c constant_test_bit
1730 1.5872 vmlinux-54a6eb5c clear_bit
1157 1.0615 vmlinux-54a6eb5c kmem_cache_alloc


> > These numbers are taken from the results of:
> > for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=2048; done
> >
> > Where md0 is created by:
> > mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 0
> >
> > I will try your debug patch next Mel, and then try to collect more data
> > with blktrace.
> >
>
> I tried reproducing this but I don't have the necessary hardware to even
> come close to reproducing your test case :( . I got some dbench results
> with oprofile but found no significant differences between 2.6.25 and
> 2.6.26-rc8. However, I did find the results of dbench varied less between
> runs with the "repork" patch that made next_zones_zonelist() an inline
> function. Have you tried that patch yet?

I gave the patch a shot but observed no change. As you predicted we are
looking at something more pronounced than some lost cpu cycles.

--
Dan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/