Re: [PATCH] Percpu tag allocator

From: Kent Overstreet
Date: Thu Jun 13 2013 - 17:53:49 EST


On Thu, Jun 13, 2013 at 11:53:18AM -0700, Tejun Heo wrote:
> Hello, Andrew, Kent.
>
> (cc'ing NFS folks for id[r|a] discussion)
>
> On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> > They all sound like pretty crappy reasons ;) If the idr/ida interface
> > is nasty then it can be wrapped to provide the same interface as the
> > percpu tag allocator.
> >
> > I could understand performance being an issue, but diligence demands
> > that we test that, or at least provide a convincing argument.
>
> The thing is that id[r|a] guarantee that the lowest available slot is
> allocated and this is important because it's used to name things which
> are visible to userland - things like block device minor number,
> device indicies and so on. That alone pretty much ensures that
> alloc/free paths can't be very scalable which usually is fine for most
> id[r|a] use cases as long as lookup is fast. I'm doubtful that it's a
> good idea to push per-cpu tag allocation into id[r|a]. The use cases
> are quite different.
>
> In fact, maybe what we can do is adding some features on top of the
> tag allocator and moving id[r|a] users which don't require strict
> in-order allocation to it. For example, NFS allocates an ID for each
> transaction it performs and uses it to index the associate command
> structure (Jeff, Bruce, please correct me if I'm getting it wrong).
> The only requirement on IDs is that they shouldn't be recycled too
> fast. Currently, idr implements cyclic mode for it but it can easily
> be replaced with per-cpu tag allocator like this one and it'd be a lot
> more scalable. There are a couple things to worry about tho - it
> probably should use the highbits as generation number as a tag is
> given out so that the actual ID doesn't get recycled quickly, and some
> form dynamic tag sizing would be nice too.

Yeah, that sounds like a perfect use.

Using the high bits as a gen number - that's something I've done before
in driver code, and that can be done completely outside the tag
allocator - no need for a cyclic mode.

For dynamic sizing, the issue is not so much dynamically sizing the tag
allocator's data structures - the tag allocator itself will use a
fraction of the memory of your tag structs - it's that you want to do
something slightly more intelligent than preallocating one giant array
of tag structs.

I already ran into this in the aio code - kiocbs are just big enough
that we don't want to preallocate them all when we allocate the kioctx.
I did the simplest thing I could think of for the aio code, but if other
users are going to be running into this too maybe it should be made
generic too.

Anyways, for aio I just use an array of pages for the kiocbs instead of
a flat array, and then the pages are allocated lazily.

http://evilpiepirate.org/git/linux-bcache.git/commit/?h=aio&id=999e7718f6b7ec99512fd576b166e5d63cd45ef2

Since the tag allocator uses stacks, it'll tend to give out ids that
were previously allocated and this should work pretty well in practice.
The one caveat right now is that if the workload is shifting across
cpus, tags being stranded on percpu freelists would cause us to allocate
pages sooner than we probably want to.

I don't think this is a big issue because the tag stealing is done based
on the worst case number of stranded tags - but I think I can improve it
with a bit of lazyness...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/