RE: [PATCH 7/8] zswap: add to mm/

From: Dan Magenheimer
Date: Thu Jan 03 2013 - 17:34:01 EST

Next message: Tejun Heo: "Re: nfsd oops on Linus' current tree."
Previous message: Rafael J. Wysocki: "Re: [PATCH] PCI / ACPI: Rework ACPI device nodes lookup for the PCI bus type"
In reply to: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Next in thread: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> From: Seth Jennings [mailto:sjenning@xxxxxxxxxxxxxxxxxx]
> Subject: Re: [PATCH 7/8] zswap: add to mm/
>
> >> What I'm looking to do is give zswap a little insight into zsmalloc
> >> internals,
>
> I'm putting at lot of thought into how to do it cleanly, while
> maintaining the generic nature of zsmalloc. I did a PoC already, but
> I wasn't comfortable with how much had to be exposed to achieve it.
> So I'm still working on it.

Zbud exposes LRU-ness as well... why not also do that
for zsmalloc? I guess I am asking, as long as you are doing
a "new API" for zsmalloc that zram doesn't benefit from, why not
eliminate the abstraction layer and consider your "zsmalloc2" to
be a captive allocator, like zbud? What's the point of limiting
"how much had to be exposed" unless you are certain that other
kernel-internal-users will benefit?

> > I fear that problem (B) is the fundamental concern with
> > using a high-density storage allocator such as zsmalloc, which
> > is why I abandoned zsmalloc in favor of a more-predictable-but-
> > less-dense allocator (zbud). However, if you have a solution
> > for (B) as well, I would gladly abandon zbud in zcache (for _both_
> > cleancache and frontswap pages) and our respective in-kernel
> > compression efforts would be more easy to merge into one solution
> > in the future.
>
> The only difference afaict between zbud and zsmalloc wrt vacating a
> page frame is that zbud will only ever need to write back two pages to
> swap where zsmalloc might have to write back more to free up the
> zspage that contains the page frame to be vacated. This is doable
> with zsmalloc. The question that remains for me is how cleanly can it
> be done (for either approach).

One other major difference is that zpages in zsmalloc can cross pageframe
boundaries. I don't understand zsmalloc well enough to estimate
how frequently this is true, but it is does complicate the set of
race conditions.

Also, unless I misunderstand, ensuring a pageframe can be freed
will require you to somehow "lock" all the zpages and remove
them atomically. At least, that's what was required for zbud
to avoid races. If this is true, increasing the number of zpages
to be freed would have a significant impact.

You may have a way around this... I'm just cautioning.

> > Hmmm... IIRC, to avoid races in zcache, it was necessary to
> > update both the data (zpage) and meta-data ("tree" in zswap,
> > and tmem-data-structure in zcache) atomically. I will need
> > to study your code more to understand how zswap avoids this
> > requirement. Or if it is obvious to you, I would be grateful
> > if you would point it out to me.
>
> Without the flushing mechanism, a simple lock guarding the tree was
> enough in zswap. The per-entry serialization of the
> store/load/invalidate paths are all handled at a higher level. For
> example, we never get a load and invalidate concurrently on the same
> swap entry.
>
> However, once the flushing code was introduced and could free an entry
> from the zswap_fs_store() path, it became necessary to add a per-entry
> refcount to make sure that the entry isn't freed while another code
> path was operating on it.

Hmmm... doesn't the refcount at least need to be an atomic_t?

Also, how can you "free" any entry of an rbtree while another
thread is walking the rbtree? (Deleting an entry from an rbtree
causes rebalancing... afaik there is no equivalent RCU
implementation for rbtrees... not that RCU would necessarily
work well for this anyway.)

BTW, in case it appears otherwise, I'm trying to be helpful, not
critical. In the end, I think we are in agreement that in-kernel
compression is very important and that the frontswap (and/or
cleancache) interface(s) are the right way to identify compressible
data, and we are mostly arguing allocation and implementation details.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tejun Heo: "Re: nfsd oops on Linus' current tree."
Previous message: Rafael J. Wysocki: "Re: [PATCH] PCI / ACPI: Rework ACPI device nodes lookup for the PCI bus type"
In reply to: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Next in thread: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]