Re: [00/17] Large Blocksize Support V3

From: Peter Zijlstra
Date: Sat Apr 28 2007 - 04:33:53 EST

On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > >
> > > The other thing is that we can batch up pagecache page insertions for bulk
> > > writes as well (that is. write(2) with buffer size > page size). I should
> > > have a patch somewhere for that as well if anyone interested.
> >
> > Together with the optimistic locking from my concurrent pagecache that
> > should bring most of the gains:
> >
> > sequential insert of 8388608 items:
> >
> >
> > [ffff81007d7f60c0] insert 0 done in 15286 ms
> >
> >
> > [ffff81006b36e040] insert 0 done in 3443 ms
> >
> > only 4.4 times faster, and more scalable, since we don't bounce the
> > upper level locks around.
> I'm not sure what we're looking at here. radix-tree changes? Locking
> changes? Both?

Both, the radix tree is basically node locked, and all modifying
operations are taught to node lock their way around the radix tree.
This, as Christoph pointed out, will suck on SMP because all the node
locks will bounce around like mad.

So what I did was couple that with an 'optimistic' RCU lookup of the
highest node that _needs_ to be locked for the operation to succeed,
lock that node, verify it is indeed still the node we need, and proceed
as normal (node locked). This avoid taking most/all of the upper level
node locks; esp for insert, which typically only needs one lock.

> If we have a whole pile of pages to insert then there are obvious gains
> from not taking the lock once per page (gang insert). But I expect there
> will also be gains from not walking down the radix tree once per page too:
> walk all the way down and populate all the way to the end of the node.

Yes, it will be even faster if we could batch insert a whole leaf node.
That would save 2^6 - 1 tree traversals and lock/unlock cycles.
Certainly not unwanted.

> The implementation could get a bit tricky, handling pages which a racer
> instantiated when we dropped the lock, and suitably adjusting ->index. Not
> rocket science though.


> The depth of the radix tree matters (ie, the file size). 'twould be useful
> to always describe the tree's size when publishing microbenchmark results
> like this.

Right, this was with two such sequences of 2^23, so 2^24 elements in
total, with 0 offset, which gives: 24/6 = 4 levels.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at