Re: [PATCH 0/7] sg_ring: a ring of scatterlist arrays

From: James Bottomley
Date: Mon Jan 07 2008 - 10:48:52 EST


On Mon, 2008-01-07 at 15:38 +1100, Rusty Russell wrote:
> On Sunday 06 January 2008 02:31:12 James Bottomley wrote:
> > On Wed, 2007-12-19 at 17:31 +1100, Rusty Russell wrote:
> > > This patch series is the start of my attempt to simplify and make
> > > explicit the chained scatterlist logic.
> > >
> > > It's not complete: my SATA box boots and seems happy, but all the other
> > > users of SCSI need to be updated and checked. But I've gotten far enough
> > > to believe it's worth persuing.
> >
> > Sorry for the delay in looking at this, I was busy with Holidays and
> > things.
>
> Thankyou for your consideration.
>
> > When I compare sg_ring with the current sg_chain (and later sg_table)
> > implementations, I'm actually struck by how similar they are.
>
> I agree, they're solving the same problem. It is possible that the pain of
> change is no longer worthwhile, but I hate to see us give up on this. We're
> adding complexity without making it harder to misuse.

We're always open to new APIs (or more powerful and expanded old ones).
The way we've been doing the sg_chain conversion is to slide API layers
into the drivers so sg_chain becomes a simple API flip when we turn it
on. Unfortunately sg_ring doesn't quite fit nicely into this.

> > The other thing I note is that the problem you're claiming to solve with
> > sg_ring (the ability to add extra scatterlists to the front or the back
> > of an existing one) is already solved with sg_chain, so the only real
> > advantage of sg_ring was that it contains explicit counts, which
> > sg_table (in -mm) also introduces.
>
> I just converted virtio using latest Linus for fair comparison

Erm, but that's not really a fair comparison; you need the sg_table code
in

git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git

branch sg as well.

> , and it's still
> pretty ugly. sg_ring needs more work to de-scsi it. sg_table is almost
> sg_ring except it retains all the chaining warts.
>
> But we hit the same problems:
>
> 1) sg_chain loses information. The clever chain packaging makes reading easy,
> but manipulation is severely limited. You can append to your own chains by
> padding, but not someone elses. This works for SCSI, but what about the rest
> of us? And don't even think of joining mapped chains: it will almost work.

OK, but realistically some of your criticisms are way outside of the
design intent. Scatterlists (and now sg_chains) are the way the block
subsystem hands pages to its underlying block devices. We do this
through two APIs: blk_rq_map_sg() to take a request and place all the
pages into a scatterlist respecting the merging parameters and
dma_map_sg() which takes the scatterlist and produces an arch specific
DMA mapped scatterlist reusing the old list. The requirement is that
dma_unmap_sg return the chain to its former state (so block device may
map, and unmap between congestion waits to avoid running systems out of
scarce IOMMU mappings), but there's not even a parallel
blk_rq_unmap_sg() beacuse the scatterlist is assumed disposable by the
block driver after the request completes.

By "someone elses" I'm assuming you mean a chain in a stacked block
driver? You are actually allowed to transform these ... depending on
who does the dma mapping, of course, if the lower driver (yours) does
DMA mapping, you can do anything you want. If the upper driver does,
then you have to preserve idempotence.

There have never until now been any requirements to join already
dma_map_sg() converted scatterlists ... that would wreak havoc with the
way we reuse the list plus damage the idempotence of map/unmap. What is
the use case you have for this?

> 2) sg_chain's end marker is only for reading non-dma elements, not for mapped
> chains, nor for writing into chains. Plus it's a duplicate of the num arg,
> which is still passed around for efficiency. (virtio had to implement
> count_sg() because I didn't want those two redundant num args).
>
> 3) By overloading struct scatterlist, it's invisible what code has been
> converted to chains, and which hasn't. Too clever by half!

No it's not ... that's the whole point. Code not yet converted to use
the API accessors is by definition unable to use chaining. Code
converted to use the accessors by design doesn't care (and so is
"converted to chains").

> Look at sg_chain(): it claims to join two scatterlists, but it doesn't. It
> assumes that prv is an array, not a chain. Because you've overloaded an
> existing type, this happens everywhere. Try passing skb_to_sgvec a chained
> skb.
>
> sg_ring would not have required any change to existing drivers, just those
> that want to use long sg lists. And then it's damn obvious which are which.

Which, by definition, would have been pretty much all of them.

Another driving factor is that the mempool allocation of scatterlists is
suboptimal in terms of their bucket size, so we were looking to tune
that once the chaining code was in. The guess is that we'd probalby end
up with a uniform size for scatterlist chunks, but if all drivers aren't
converted we can't do this type of tuning.

> 4) sg_chain missed a chance to combine all the relevent information (max, num,
> num_dma and the sg array). sg_table comes close, but you still can't join
> them (no max information, and even if there were, what if it's full?).
> Unlike sg_ring, it's unlikely to reduce bugs.

sg_table is sg_chain ... they're incremental extensions of the same body
of work.

> 5) (A little moot now) sg_ring didn't require arch changes.
>
> > The other differences are that sg_ring only allows adding at the front
> > or back of an existing sg_ring, it doesn't allow splicing at any point
> > like sg_chain does, so I'd say it's less functional (not that I actually
> > want anyone ever to do this, of course ...)
>
> Well it's just as possible, but you might have to copy more elements (with sg
> chaining you need only copy 1 sg element to make room for the chain elem).
> Agreed that it's a little out there...
>
> > The final point is that sg_ring requires a two level traversal: ring
> > list then scatterlist, whereas sg_chain only requires a single level
> > traversal. I grant that we can abstract out the traversal into
> > something that would make users think they're only doing a single level,
> > but I don't see what the extra level really buys us.
>
> We hide the real complexity from users and it makes it less safe for
> non-trivial cases.
>
> Hence the introduction of YA datastructure: sg_table. This is getting close:
> just hang the array off it and you'll have sg_ring and no requirement for
> dynamic alloc all the time. And once you have a header, no need for chaining
> tricks...
>
> > The only thing missing from sg_chain perhaps is an accessor function
> > that does the splicing, which I can easily construct if you want to try
> > it out in virtio.
>
> I don't need that (I prepend and append), but it'd be nice if sg_next took a
> const struct scatterlist *.

What all this is saying to me is that we have an undesigned, ad hoc
interface for scatterlist manipulation in passthrough drivers. Perhaps
we should start by defining that API and its associated requirements.
i.e. what is it that pass through drivers want to be able to do to
scatterlists? The only such example of a driver like that I know is the
crypto API and now your virtio. Once we have the actual requirements
and then the API, I think the natural data structures will probably drop
out.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/