Re: [00/17] Large Blocksize Support V3
From: Peter Zijlstra
Date: Sun Apr 29 2007 - 09:37:46 EST
On Sat, 2007-04-28 at 01:55 -0700, Andrew Morton wrote:
> On Sat, 28 Apr 2007 10:32:56 +0200 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > On Sat, 2007-04-28 at 01:22 -0700, Andrew Morton wrote:
> > > On Sat, 28 Apr 2007 10:04:08 +0200 Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > >
> > > > >
> > > > > The other thing is that we can batch up pagecache page insertions for bulk
> > > > > writes as well (that is. write(2) with buffer size > page size). I should
> > > > > have a patch somewhere for that as well if anyone interested.
> > > >
> > > > Together with the optimistic locking from my concurrent pagecache that
> > > > should bring most of the gains:
> > > >
> > > > sequential insert of 8388608 items:
> > > >
> > > > CONFIG_RADIX_TREE_CONCURRENT=n
> > > >
> > > > [ffff81007d7f60c0] insert 0 done in 15286 ms
> > > >
> > > > CONFIG_RADIX_TREE_OPTIMISTIC=y
> > > >
> > > > [ffff81006b36e040] insert 0 done in 3443 ms
> > > >
> > > > only 4.4 times faster, and more scalable, since we don't bounce the
> > > > upper level locks around.
> > >
> > > I'm not sure what we're looking at here. radix-tree changes? Locking
> > > changes? Both?
> > Both, the radix tree is basically node locked, and all modifying
> > operations are taught to node lock their way around the radix tree.
> > This, as Christoph pointed out, will suck on SMP because all the node
> > locks will bounce around like mad.
> > So what I did was couple that with an 'optimistic' RCU lookup of the
> > highest node that _needs_ to be locked for the operation to succeed,
> > lock that node, verify it is indeed still the node we need, and proceed
> > as normal (node locked). This avoid taking most/all of the upper level
> > node locks; esp for insert, which typically only needs one lock.
> hm. Maybe if we were taking the lock once per N pages rather than once per
> page, we could avoid such trickiness.
For insertion, maybe, not all insertion is coupled with strong
But this also work for the other modifiers, like radix_tree_tag_set(),
which is typically called on single pages.
The whole idea is to entirely remove mapping->tree_lock, and serialize
the pagecache on page level.
> > > If we have a whole pile of pages to insert then there are obvious gains
> > > from not taking the lock once per page (gang insert). But I expect there
> > > will also be gains from not walking down the radix tree once per page too:
> > > walk all the way down and populate all the way to the end of the node.
> > Yes, it will be even faster if we could batch insert a whole leaf node.
> > That would save 2^6 - 1 tree traversals and lock/unlock cycles.
> > Certainly not unwanted.
> > > The implementation could get a bit tricky, handling pages which a racer
> > > instantiated when we dropped the lock, and suitably adjusting ->index. Not
> > > rocket science though.
> > *nod*
> > > The depth of the radix tree matters (ie, the file size). 'twould be useful
> > > to always describe the tree's size when publishing microbenchmark results
> > > like this.
> > Right, this was with two such sequences of 2^23, so 2^24 elements in
> > total, with 0 offset, which gives: 24/6 = 4 levels.
> Where an element would be a page, so we're talking about a 64GB file with
> 4k pagesize?
> Gosh that tree has huge fanout. We fiddled around a bit with
> RADIX_TREE_MAP_SHIFT in the early days. Changing it by quite large amounts
> didn't appear to have much effect on anything, actually.
It might make sense to decrease it a bit with this work, it gives more
opportunity to parallelize.
Currently the tree_lock is pretty much the most expensive operation in
this area, it happily bounces around the machine.
The lockless work showed there is lots of room for improvement here.
> btw, regarding __do_page_cache_readahead(): that function is presently
> optimised for file rereads - if all pages are present it'll only take the
> lock once. The first version of it was optimised the other way around - it
> took the lock once for uncached file reads (no pages present), but once per
> page for rereads. I made this change because Anton Blanchard had some
> workload which hurt, and we figured that rereads are the common case, and
> uncached reads are limited by disk speed anyway.
> Thinking about it, I guess that was a correct decision, but we did
> pessimise these read-a-gargantuan-file cases.
> But we could actually implement both flavours and pick the right one to
> call based upon the hit-rate metrics which the readahead code is
> maintaining (or add new ones).
> Of course, gang-populate would be better.
I could try to do a gang insert operation on top of my concurrent
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/