Re: [patch] Converting writeback linked lists to a tree based data structure

From: David Chinner
Date: Fri Jan 18 2008 - 21:51:21 EST

Next message: devzero: "Re: [BUG] 2.6.23-rc9 kernel panic - simple_map_write+0x4e/0x75"
Previous message: Andi Kleen: "Re: [patch 02/11] PAT x86: Map only usable memory in x86_64 identity map and kernel text"
In reply to: Fengguang Wu: "Re: [patch] Converting writeback linked lists to a tree based data structure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote:
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> >
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> >
> > slice 1: iterate over small files
> > slice 2: flush large file 1
> > slice 3: iterate over small files
> > slice 4: flush large file 2
> > ......
> > slice n-1: flush large file N
> > slice n: iterate over small files
> > slice n+1: flush large file N+1
> >
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
>
> If we can sync fast enough, the lower layer would be able to merge
> those 4MB requests.

No, not necessarily - think of a stripe with a chunk size of 512k.
That 4MB will be split into 8x512k chunks and sent to different
devices (and hence elevator queues). The only way you get elevator
merging in this sort of config is that if you send multiple stripe
*width* sized amounts to the device in a very short time period.
I see quite a few filesystems with stripe widths in the tens of MB
range.....

> > Put simply:
> >
> > The higher the bandwidth of the device, the more frequently
> > we need to be servicing the inodes with large amounts of
> > dirty data to be written to maintain write throughput at a
> > significant percentage of the device capability.
> >
> > The writeback algorithm needs to take this into account for it
> > to be able to scale effectively for high throughput devices.
>
> Slow queues go full first. Currently the writeback code will skip
> _and_ congestion_wait() for congested filesystems. The better policy
> is to congestion_wait() _after_ all other writable pages have been
> synced.

Agreed.

The comments I've made are mainly concerned with getting efficient
flushing of a single device occuring. Interactions between multiple
devices are a separable issue....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: devzero: "Re: [BUG] 2.6.23-rc9 kernel panic - simple_map_write+0x4e/0x75"
Previous message: Andi Kleen: "Re: [patch 02/11] PAT x86: Map only usable memory in x86_64 identity map and kernel text"
In reply to: Fengguang Wu: "Re: [patch] Converting writeback linked lists to a tree based data structure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]