Re: [00/17] Large Blocksize Support V3

From: Jens Axboe
Date: Thu Apr 26 2007 - 14:17:09 EST


On Thu, Apr 26 2007, Christoph Hellwig wrote:
> On Thu, Apr 26, 2007 at 08:03:58PM +0200, Jens Axboe wrote:
> > On Thu, Apr 26 2007, Christoph Lameter wrote:
> > > > Iff we really the larger physical page size to support the hardware
> > > > then it makes sense to go down a path of larger pages. But it doesn't.
> > >
> > > You are redefining the problem. We need larger physical sizes to support
> > > the hardware. Yes. We can dodge the issue with shim layers and hacks. It
> > > is obvious from the kernel sources that this is needed.
> >
> > We definitely don't. Larger sizes are ONE way to solve the problem, they
> > are definitely not the only one. If the larger pages become unfeasible
> > for some reason (be it fragmentation, or just because the design isn't
> > good), then we can solve it differently.
>
> Exactly. But the only counter-proposal we have so far seems far worse :)

Lets look at some numbers. I'll just concentrate on the scatterlist,
since the bio_vec is smaller. On x86 32-bit, the scatterlist is 20 bytes
long. If we accept that 2^1 allocations are ok (they should be), then we
can support ~1.6mb ios just like that.

My approach would be to support scatterlist chaining. Essentially you'd
have the last element of the sglist pointing to the next array of
entries. We can then stick to 128 entry arrays which fit nicely in a
single page allocation and easily support >> 2mb ios. The only caveat is
that you'd need to update the drivers to get there, since a regular
iteration over the array isn't enough. My plan was to add an sglist
iterator helper that hides this from the drivers, if they need to loop
over the scatterlist. Things like {dma/pci}_map_sg() would of course be
updated.

The above can be implemented fairly cleanly, and on a need-to-have
basis. It's not something that'll break drivers.

What do you think?

--
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/