Re: [PATCH 7/8] zswap: add to mm/

From: Dave Chinner
Date: Thu Jan 03 2013 - 02:33:32 EST

Next message: Dmitry Torokhov: "Re: [PATCH 1/2] timer: vt8500: Move system timer to clocksource"
Previous message: Martin Schwidefsky: "Re: linux-next: manual merge of the kvm tree with the s390 tree"
In reply to: Dan Magenheimer: "RE: [PATCH 7/8] zswap: add to mm/"
Next in thread: Dan Magenheimer: "RE: [PATCH 7/8] zswap: add to mm/"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jan 02, 2013 at 11:04:24AM -0800, Dan Magenheimer wrote:
> > From: Dave Hansen [mailto:dave@xxxxxxxxxxxxxxxxxx]
> > Subject: Re: [PATCH 7/8] zswap: add to mm/
>
> Hi Dave --
>
> I suspect we are in violent agreement but just to make sure...
>
> Although zswap is the current example, I guess I am discussing
> a bigger issue, which IMHO is much more important: How should
> compression be utilized in the kernel (if at all)? Zswap is
> simply one implementation of in-kernel compression (handling
> anonymous pages only) and zcache is another (handling both
> anonymous pages and pagecache pages). Each has some
> limited policy, and policy defaults built-in, but neither IMHO
> is adequately aware of (let alone integrated with) MM policy to
> be useful to a broad set of end users and to be enabled by default
> by generic distros.
>
> > On 01/02/2013 09:26 AM, Dan Magenheimer wrote:
> > > However if one compares the total percentage
> > > of RAM used for zpages by zswap vs the total percentage of RAM
> > > used by slab, I suspect that the zswap number will dominate,
> > > perhaps because zswap is storing primarily data and slab is
> > > storing primarily metadata?
> >
> > That's *obviously* 100% dependent on how you configure zswap. But, that
> > said, most of _my_ systems tend to sit with about 5% of memory in
> > reclaimable slab
>
> The 5% "sitting" number for slab is somewhat interesting, but
> IMHO irrelevant here. The really interesting value is what percent
> is used by slab when the system is under high memory pressure; I'd
> imagine that number would be much smaller. True?

Not at all. The amount of slab memory used is wholly dependent on
workload. I have plenty of workloads with severe memory pressure
that I test with that sit at a steady state of >80% of ram in slab
caches. These workloads are filesytem metadata intensive rather than
data intensive, that's exactly the right cache balance for the
system to have....

Thinking that there is a fixed amount of memory that you should
reserve for some subsystem is simply the wrong approach to take.
caches are dynamic and the correct system balance should result of
the natural behaviour of the reclaim algorithms.

The shrinker infrastructure doesn't set any set size goals - it
simply tries to balance the reclaim across all the shrinkers and
relative to the page cache. If a cache is under allocation
pressure, then it will grow to the point that reclaim is balanced
with the allocation pressure and they won't grow any further. If the
allocation pressure drops, then the cache will shrink if overall
memory pressure is maintained.....

> > > I don't claim to be any kind of expert here, but I'd imagine
> > > that MM doesn't try to manage the total amount of slab space
> > > because slab is "a cost of doing business".

>From the above it should be obvious that the MM subsystem really
does manage the total amount of slab space being used....

> > > However, for
> > > in-kernel compression to be widely useful, IMHO it will be
> > > critical for MM to somehow load balance between total pageframes
> > > used for compressed pages vs total pageframes used for
> > > normal pages, just as today it needs to balance between
> > > active and inactive pages.
> >
> > The issue isn't about balancing. It's about reclaim where the VM only
> > cares about whole pages. If our subsystem (zwhatever or slab) is only
> > designed to reclaim _parts_ of pages, can we be successful in returning
> > whole pages to the VM?
>
> IMHO, it's about *both* balancing _and_ reclaim. One remaining
> major point of debate between zcache and zswap is that zcache
> accepts lower density to ensure that whole pages can be easily
> returned to the VM (and thus allow balancing) while zswap targets
> best density (by using zsmalloc) and doesn't address returning
> whole pages to the VM.

And so the two subsystems need different reclaim implementations.
And, well, that's exactly what we have shrinkers for - implmenting
subsystem specific reclaim policy. The shrinker infrastructure is
responsible for them keeping balance between all the caches that
have shrinkers and the size of the page cache...

> > The slab shrinkers only work on parts of pages (singular slab objects).
> > Yet, it does appear that they function well enough when we try to
> > reclaim from them. I've never seen a slab's sizes spiral out of control
> > due to fragmentation.
>
> Perhaps this is because the reclaimable slab objects are mostly
> metadata which is highly connected to reclaimable data objects?
> E.g. reclaiming most reclaimable data pages also coincidentally
> reclaims most slab objects?

No, that's not true. Caches can have some very complex
heirarchies with dependencies across multiple slabs and shrinkers,
not to mention that the caches don't even need to be related to filesystems or the
page cache. Indeed, look at the shrinkers attached to the memory
pools used by the acceleration engines for graphics hardware...

There are also cases where we've moved metadata caches out of the
page cache into shrinker controlled caches because the page cache
reclaim is too simplistic to handle the complex relationships
between filesystem metadata. We've done this in XFS, and IIRC btrfs
did this recently as well...

> (Also, it is not the slab size that would be the issue here but
> its density... i.e. if, after shrinking, 1000 pageframes contain
> only 2000 various 4-byt objects, that would be "out of control".
> Is there any easy visibility into slab density?)

/proc/slabinfo via slabtop, perhaps?

Active / Total Objects (% used) : 1798915 / 1913060 (94.0%)
Active / Total Slabs (% used) : 238160 / 238169 (100.0%)
Active / Total Caches (% used) : 119 / 203 (58.6%)
Active / Total Size (% used) : 843529.59K / 859896.40K (98.1%)
Minimum / Average / Maximum Object : 0.02K / 0.45K / 4096.00K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
689872 688486 99% 0.88K 172468 4 689872K xfs_inode
294320 288968 98% 0.19K 14716 20 58864K dentry
212380 173765 81% 0.10K 5740 37 22960K buffer_head
192576 179537 93% 0.06K 3264 59 13056K size-64
82705 82248 99% 0.22K 4865 17 19460K xfs_ili
48180 43787 90% 0.31K 4015 12 16060K xfs_buf
47712 40480 84% 0.03K 426 112 1704K size-32
44572 38447 86% 0.17K 2026 22 8104K vm_area_struct
.....

> In any case, I would posit that both the nature of zpages and their
> average size relative to a whole page is quite unusual compared to slab.

Doesn't sound at all unusual.

> So while there may be some useful comparisons between zswap
> and slab, the differences may warrant dramatically different policy.

There may be differences, but it doesn't sound like there's anything
you can't implment with an appropriate shrinker implmentation....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dmitry Torokhov: "Re: [PATCH 1/2] timer: vt8500: Move system timer to clocksource"
Previous message: Martin Schwidefsky: "Re: linux-next: manual merge of the kvm tree with the s390 tree"
In reply to: Dan Magenheimer: "RE: [PATCH 7/8] zswap: add to mm/"
Next in thread: Dan Magenheimer: "RE: [PATCH 7/8] zswap: add to mm/"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]