Re: [patch 31/35] fs: icache per-zone inode LRU

From: Dave Chinner
Date: Wed Oct 20 2010 - 06:19:26 EST


On Wed, Oct 20, 2010 at 02:20:24PM +1100, Nick Piggin wrote:
> On Wed, Oct 20, 2010 at 12:14:32PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Oct 19, 2010 at 02:42:47PM +1100, npiggin@xxxxxxxxx wrote:
> > > Anyway, my main point is that tying the LRU and shrinker scaling to
> > > the implementation of the VM is a one-off solution that doesn't work
> > > for generic infrastructure. Other subsystems need the same
> > > large-machine scaling treatment, and there's no way we should be
> > > tying them all into the struct zone. It needs further abstraction.
> >
> > I'm not sure what data structure is best. I can only say current
> > zone unawareness slab shrinker might makes following sad scenario.
> >
> > o DMA zone shortage invoke and plenty icache in NORMAL zone dropping
> > o NUMA aware system enable zone_reclaim_mode, but shrink_slab() still
> > drop unrelated zone's icache
> >
> > both makes performance degression. In other words, Linux does not have
> > flat memory model. so, I don't think Nick's basic concept is wrong.
> > It's straight forward enhancement. but if it don't fit current shrinkers,
> > I'd like to discuss how to make better data structure.
> >
> >
> >
> > and I have dump question (sorry, I don't know xfs at all). current
> > xfs_mount is below.
> >
> > typedef struct xfs_mount {
> > ...
> > struct shrinker m_inode_shrink; /* inode reclaim shrinker */
> > } xfs_mount_t;
> >
> >
> > Do you mean xfs can't convert shrinker to shrinker[ZONES]? If so, why?
>
> Well if XFS were to use per-ZONE shrinkers, it would remain with a
> single shrinker context per-sb like it has now, but it would divide
> its object management into per-zone structures.

<sigh>

I don't think anyone wants per-ag X per-zone reclaim lists on a 1024
node machine with a 1,000 AG (1PB) filesystem.

As I have already said, the XFS inode caches are optimised in
structure to minimise IO and maximise internal filesystem
parallelism. They are not optimised for per-cpu or NUMA scalability
because if you don't have filesystem level parallelism, you can't
scale to large numbers of concurrent operations across large numbers
of CPUs in the first place.

In the case of XFS, per-allocation group is the way we scale
internal parallelism and as long as you have more AGs than you have
CPUs, there is very good per-CPU scalability through the filesystem
because most operations are isolated to a single AG. That is how we
scale parallelism in XFS, and it has proven to scale pretty well for
even the largest of NUMA machines.

This is what I mean about there being an impedence mismatch between
the way the VM and the VFS/filesystem caches scale. Fundamentally,
the way filesystems want their caches to operate for optimal
performance can be vastly different to the way you want shrinkers to
operate for VM scalability. Forcing the MM way of doing stuff down
into the LRUs and shrinkers is not a good way of solving this
problem.

> For subsystems that aren't important, don't take much memory or have
> much reclaim throughput, they are free to ignore the zone argument
> and keep using the global input to the shrinker.

Having a global lock in a shrinker is already a major point of
contention because shrinkers have unbound parallelism. Hence all
shrinkers need to be converted to use scalable structures. What we
need _first_ is the infrastructure to do this in a sane manner, not
tie a couple of shrinkers tightly into the mm structures and then
walk away.

And FWIW, most subsystems that use shrinkers can be compiled in as
modules or not compiled in at all. That'll probably leave #ifdef
CONFIG_ crap all through the struct zone definition as they are
converted to use your current method....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/