Re: [00/17] Large Blocksize Support V3

From: Andrew Morton
Date: Sat Apr 28 2007 - 04:56:30 EST

On Sat, 28 Apr 2007 10:32:56 +0200 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:

> On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> > On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> >
> > > >
> > > > The other thing is that we can batch up pagecache page insertions for bulk
> > > > writes as well (that is. write(2) with buffer size > page size). I should
> > > > have a patch somewhere for that as well if anyone interested.
> > >
> > > Together with the optimistic locking from my concurrent pagecache that
> > > should bring most of the gains:
> > >
> > > sequential insert of 8388608 items:
> > >
> > >
> > > [ffff81007d7f60c0] insert 0 done in 15286 ms
> > >
> > >
> > > [ffff81006b36e040] insert 0 done in 3443 ms
> > >
> > > only 4.4 times faster, and more scalable, since we don't bounce the
> > > upper level locks around.
> >
> > I'm not sure what we're looking at here. radix-tree changes? Locking
> > changes? Both?
> Both, the radix tree is basically node locked, and all modifying
> operations are taught to node lock their way around the radix tree.
> This, as Christoph pointed out, will suck on SMP because all the node
> locks will bounce around like mad.
> So what I did was couple that with an 'optimistic' RCU lookup of the
> highest node that _needs_ to be locked for the operation to succeed,
> lock that node, verify it is indeed still the node we need, and proceed
> as normal (node locked). This avoid taking most/all of the upper level
> node locks; esp for insert, which typically only needs one lock.

hm. Maybe if we were taking the lock once per N pages rather than once per
page, we could avoid such trickiness.

> > If we have a whole pile of pages to insert then there are obvious gains
> > from not taking the lock once per page (gang insert). But I expect there
> > will also be gains from not walking down the radix tree once per page too:
> > walk all the way down and populate all the way to the end of the node.
> Yes, it will be even faster if we could batch insert a whole leaf node.
> That would save 2^6 - 1 tree traversals and lock/unlock cycles.
> Certainly not unwanted.
> > The implementation could get a bit tricky, handling pages which a racer
> > instantiated when we dropped the lock, and suitably adjusting ->index. Not
> > rocket science though.
> *nod*
> > The depth of the radix tree matters (ie, the file size). 'twould be useful
> > to always describe the tree's size when publishing microbenchmark results
> > like this.
> Right, this was with two such sequences of 2^23, so 2^24 elements in
> total, with 0 offset, which gives: 24/6 = 4 levels.

Where an element would be a page, so we're talking about a 64GB file with
4k pagesize?

Gosh that tree has huge fanout. We fiddled around a bit with
RADIX_TREE_MAP_SHIFT in the early days. Changing it by quite large amounts
didn't appear to have much effect on anything, actually.

btw, regarding __do_page_cache_readahead(): that function is presently
optimised for file rereads - if all pages are present it'll only take the
lock once. The first version of it was optimised the other way around - it
took the lock once for uncached file reads (no pages present), but once per
page for rereads. I made this change because Anton Blanchard had some
workload which hurt, and we figured that rereads are the common case, and
uncached reads are limited by disk speed anyway.

Thinking about it, I guess that was a correct decision, but we did
pessimise these read-a-gargantuan-file cases.

But we could actually implement both flavours and pick the right one to
call based upon the hit-rate metrics which the readahead code is
maintaining (or add new ones).

Of course, gang-populate would be better.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at