Re: [GIT PULL] mm: frontswap (for 3.2 window)
From: Andrea Arcangeli
Date: Tue Nov 01 2011 - 21:36:25 EST
Hi Dan.
On Tue, Nov 01, 2011 at 02:00:34PM -0700, Dan Magenheimer wrote:
> Pardon me for complaining about my typing fingers, but it seems
> like you are making statements and asking questions as if you
> are not reading the whole reply before you start responding
> to the first parts. So it's going to be hard to answer each
> sub-thread in order. So let me hit a couple of the high
> points first.
I'm actually reading all your reply, if I skip some part it may be
because the email is too long already :). I'm just trying to
understand it and I wish I had more time to dedicate to this too but
I've other pending stuff too.
> Let me emphasize and repeat: Many of your comments
> here are addressing zcache, which is a staging driver.
> You are commenting on intra-zcache APIs only, _not_ on the
> tmem ABI. I realize there is some potential confusion here
> since the file in the zcache directory is called tmem.c;
> but it is NOT defining or implementing the tmem ABI/API used
So where exactly is the tmem API if it's not in tmem.c in
staging/zcache? I read what is in the kernel and I comment on it.
> by the kernel. The ONLY kernel API that need be debated here
> is the code in the frontswap patchset, which provides
That code is calling into zcache that calls into tmem.. so I'm not
sure how you can pretend we focus only in forntswap. Also I don't care
if zcache is already merged in staging, I may still like to see
changes happening there.
> registration for a set of function pointers (see the
> struct frontswap_ops in frontswap.h in the patch) and
> provides the function calls (API) between the frontswap
> (and cleancache) "frontends" and the "backends" in
> the driver directory. The zcache file "tmem.c" is simply
> a very early attempt to tease out core operations and
> data structures that are likely to be common to multiple
> tmem users.
It's a little hard to follow, so that tmem.c is not the real tmem and
tries to mirror the real tmem.c that people will really use trying to
be compatible when the real tmem.c will be used instead?
It looks even more weird than I thought, why isn't the real tmem.c in
the zcache directory instead of an attempt to tease out core
operations and data structures? Maybe I just misunderstood what tmem.c
is about in zcache.
> Everything in zcache (including tmem.c) is completely open
> to evolution as needed (by KVM or other users) and this
> will need to happen before zcache is promoted out of staging.
> So your comments will be very useful when "we" work
> on that promotion process.
Again I don't care frontswap out-of-tree, zcache-in-tree
tmem-of-zcache-not-real-tmem-in-tree, I will comment on any of
those. I understand your focus is to get frontswap merged, but on my
side to review frontswap I am forced to read zcache and if I don't
like something there, it's unlikely I can like the _caller_ of zcache
code either (i.e. frontswap_put).
I don't see this as a showstopper on your side, if you agree why don't
you start fixing what is _in_tree_ first, and then submit frontswap
again?
I mean if you proof what you're pushing in (and what you already
pushed in staging) is the way to go, you shouldn't be so worried about
frontswap being merged immediately. I think the merge of frontswap
it's going to work much better if you convince the whole VM camp what
is in tree (zcache/tmem.c) is the way to go, then there won't be much
opposition to merge frontswap and make the core VM add hooks for
something already proven _worthwhile_.
I believe if all the reviewers and commenters would think the zcache
directory is the way to go to store swapcache before it hits swap, you
wouldn't have much problem to add changes to the core VM to put
swapcache into it.
But when things gets merged they never go out of tree, or rarely go
out of tree, and the maintenance is then upon us and the whole
community. So before adding dependencies on the core VM to
zcache/tmem.c it'd be nicer to be sure it's the way to go...
I hope this explains why I am "forced" to look into tmem.c while
commenting on frontswap.
> So, I'm going to attempt to ignore the portions of your
> reply that are commenting specifically about zcache coding
> issues and reply to the parts that potentially affect
> acceptance of the frontswap patchset, but if I miss anything
> important to you, please let me know.
So you call the tmem_put getting a char a not a page "zcache coding",
but to me it's not even clear if tmem would equally be happy to use a
page structure. Would that remain compatible with what you call above
"multiple tmem users" or not?
It's a little hard to see where Xen starts and where the kernel ends
here. Can Xen make any use of the kernel code you pushed in staging
yet? Where does the Xen API starts there? I'd like to compare to the
real tmem.c if zcache/tmem.c isn't it.
And no I don't imply the cast of the page to char is a problem at all,
I assume you're perfectly right that it's a coding issue, and may be
very well fixable with a few liner fix, but then why not proof the
point and fix it, instead of keep insisting we review frontswap as a
standalone entity like if it wasn't calling the code I am
commenting? And deferring the (albeit minor) fixage of tmem API for
later after we already are calling tmem_put from the core VM?
Or do you disagree with my proposed changes? I don't think it's
unreasonable to ask you to cleanup and make more proper what is in
tree already before adding more "stuff" that depends on it and would
have to be maintained _forever_ in the VM?
I don't actually ask perfection, but it'd be easier if you cleaned up
what looks "fishy".
> Other than the fact that cleancache_put is called with
> irqs disabled (and, IIRC, sometimes cleancache_flush?)
> and the coding complications that causes, you are correct.
>
> Preemption does need to be disabled though and, IIRC,
> in some cases, softirqs.
What code runs in softirqs that could race with zcache+frontswap?
BTW, I wonder if the tree_lock is clearing irqs only to avoid getting
interrupted during the critical section as a performance optimization
(normally __delete_from_swap_cache would be way faster than a
page-compression but with page compression in, unless there's real
code running from irqs it's unlikely we want to insist with irqs
disabled, that only is a good optimization for fast code, delaying
irqs 20 times more than normal isn't so good, would be better if those
hooks run outside of the tree_lock).
> Possible I suppose. But unless you can teach bio to deal
> with dynamically time-and-size varying devices, you are
> not implementing the most important value of the tmem concept,
> you are just re-implementing zram. And, as I said, Nitin
> supports frontswap because it is better than zram for
> exactly this (dynamicity) reason: https://lkml.org/lkml/2011/10/28/8
I don't think the bio should deal with that, the bios can write at
hardblocksize granularity, it should be up to an upper layer to find
stuff to compact into tmem and write it out in a compact way, in a way
that the "cookie" returned to swapcache code can still read from it
when tmem is asked a .get operation.
BTW, this swapcache_put/get is a bit misleading with
get_page/put_page, maybe tmem_store/tmem_load are more appropriate
names for the API. Normally we call get on a page when we take a
refcount on it and put when we release it.
> Just like all existing memory management code, zcache depends
> on some heuristics, which can be improved as necessary over time.
> Some of the metrics that feed into the heuristics are in
> debugfs so they can be manipulated as zcache continue to
> develop. (See zv_page_count_policy_percent for example...
> yes this is still a bit primitive. And before you start up
> about dynamic sizing, this is only a maximum.)
I wonder how can you tune it without any direct feedback from the VM
pressure, VM pressure changes too fast to poll it. The zcache pool
should shrink fast and be pushed into real disk (ideally without
requiring decompression and regular swapout but by compacting it and
writing it out in a compact way), if there's mlocked memory growing for
example.
> For Xen, there is a "memmax" for each guest, and Xen tmem disallows
> a guest from using a page for swap (tmem calls it a "persistent" pool)
> if it has reached its memmax. Thus unless a tmem-enabled guest
> is "giving", it can never expect to "get".
>
> For KVM, you can overcommit in the host, so you could choose a
> different heuristic... if you are willing to accept host swapping
> (which I think is evil :-)
Well host swapping with zcache working on host is going to be
theoretically (modulo implementation issues) faster than anything else
because it won't run into any vmexits. Swapping in guest is forced to
go through vmexits to page out to disk (and now even tmem 4k calls
apparently). Plus we keep the whole system balance in the host VM
without having to try to mix what has been collected by each guest VM.
It's like comparing host I/O vs guest I/O, host I/O is always faster
by definition and we have host swapping fully functional, the mmu
notifier overhead is nothing compared to the regular pte tlb flushing
and VM overhead (that you have to pay in guest anyway to overcommit in
guest).
> Well, you have yet to convince me that an extra copy is
> so damning, especially on a modern many-core CPU where it
> can be done in 256 cycles and especially when the cache-pollution
> for the copy is necessary for the subsequent compression anyway.
I thought we agreed the extra copy (like highmem bounce buffers) that
destroys cpu caches was the worst possible thing, not sure why you
bring it back.
To remove the disable of irqs you've just to check which code could
run from irq, and it's a bit hard to see what... maybe it's fixable.
> Think of zcache as zram PLUS dynamicity PLUS ability to dynamically
> trade off memory utilization against compressed page cache.
It's hard to see how you can obtain dynamicity and no risk of going
OOM prematurely if suddenly if a mlockall() program tries to allocate
all RAM in the system. You need some more hooks in the VM than the one
you have today, and that applies to cleancache too I think.
It starts to look a big hack that works for VM and can fall apart if
exposed to the wrong workload that uses mlockall or similar. I
understand it's corner cases but we can't add cruft to the VM.
You've no idea how many times I hear of people adding hooks here and
there, last time it was to make mremap run with a binary only module
and move 2M pages allocated at boot not visible to the VM. There is no
way we can add hooks all over the place every time somebody invents
something that helps a specific workload. What we had must work 100%
for everything. So mlockall() using all RAM and triggering an OOM with
zcache+cleancache enabled, it's not ok in my view.
I think it can be fixed so I don't mean it's not good but somebody
should work in fixing it, not just leaving the code unchanged and keep
pushing. I mean this thing looks more complicated than it is in the
current implementation if it's claimed to be fully dynamic and it
looks like it can backfire on the wrong workload.
> > And what is the exact reason of the local_irq_save for doing it
> > zerocopy?
>
> (Answered above I think? If not, let me know.)
No, you didn't actually tell the exact line of code that runs from irq
and that races with the code, and that requires to disable irq. I
still have no clue why irqs must be disabled. You now mentioned
sofitrqs, what is the code running in softirqs that requires disabling
irqs.
> > Would I'd like is a mechanism where you:
> >
> > 1) add swapcache to zcache (with fallback to swap immediately if zcache
> > allocation fails)
>
> Current swap code pre-selects the swap device several layers
> higher in the call chain, so this requires fairly major surgery
> on the swap subsystem... and the long bug-tail that implies.
Well you can still release the swapcache once tmem_put (better
tmem_store) succeeds. Then it's up to the zcache layer to allocate
more swap entries and store it in the swap in a compact way.
> > 2) when some threshold is hit or zcache allocation fails, we write the
> > compressed data in a compact way to swap (freeing zcache memory),
> > or swapcache directly to swap if no zcache is present
>
> Has efficient writing (and reading) of smaller-than-page chunks
> through blkio every been implemented? I know compression can be
> done "behind the curtain" of many I/O devices, but am unaware
> that the same functionality exists in the kernel. If it doesn't
> exist, this requires fairly major surgery on the blkio subsystem.
> If it does exist, I doubt the swap subsystem is capable of using
> it without major surgery.
Again I don't think compacting is the task of the I/O subsystem. Quite
obviously not. Even reiser3 writes to disk the tails compacted and
surely doesn't require changes to the storage layer. The algorithm
belongs to tmem or whatever abstraction where we stored the swapcache.
> Yeah, I agree that sounds like a cool high-level design for a
> swap subsystem rewrite. Problem is it doesn't replace the dynamicity
> to do what frontswap does for virtualization and multiple physical
> machines (RAMster). Just not as flexible.
>
> And do you really want to rewrite the swap subsystem anyway
> when a handful of frontswap hooks do the same thing (and more)?
Not a plan to change the swap subsystem, I don't think it requires a
rewrite to just improve it. You are improving the swap subsystem
while adding frontswap. Not me. So I'd like the improvement to go in
the right direction.
If we add frontswap with tmem today, it shall be able tomorrow to
write the compressed data compacted on the swap device, without
requiring nuking frontswap. I mean incremental steps are totally fine,
it doesn't need to do it now, but it must be able to do it
later. Which means tmem must be somehow able to attach its memory to
bios, allocate swap entries with get_swap_page and write the tmem
memory there. I simply wouldn't like something that adds more work to
do later when we want swap to improve further.
> > I'm afraid adding frontswap in this form will still get stuck us in
> > the wrong model and most of it will have to be dropped and rewritten
> > to do just the above 3 points I described to do proper swap
> > compression.
>
> This is a red herring. I translate this as "your handful of hooks
> might interfere with some major effort that I've barely begun to
> design". And even if you DO code that major effort... the
> frontswap hooks are almost trivial and clearly separated from
> most of the core swap code... how do you know those hooks will
> interfere with your grand plan anyway?
Hey this is what I'm asking... I'm asking if these hooks will
interfere or not. If they're tailored for Xen or if they can make the
Linux Kernel VM better for the long run and we can go ahead and swap
the tmem memory to disk compacted later, or not. I guess everything is
possible but the simpler design the better. And I've no clue if this
is the simpler design.
> Do I have to quote Linus's statement from the KS2011 minutes
> again? :-)
Well don't worry it's not my decision if things go in or not, and I
tend to agree it's not huge work to remove frontswap later if needed,
but it is quite apparent you don't want to make changes at all, and
you need it to be merged in this form. Which makes me wonder if this
is because of hidden Xen ABI issues in tmem.c or similar Xen issues or
if it's just because you think the code should change later after it's
all upstream, including frontswap.
> IIUC, you're talking about improvements to host-swapping here.
> That is (IMHO) putting lipstick on a pig. And, in any case, you
> are talking about significant swap subsystm changes that only help
> a single user, KVM. You seem to be already measuring non-existent
A single user KVM? You've got to be kidding me.
The whole basis of the KVM design, and why I refuse Xen is: we never
improve KVM, we improve the kernel for non-virt usages! And by
improving the kernel for non-virt usages, we also improve KVM. KVM is
just like firefox or apache or anything that uses anonymous memory.
I never thought of KVM in the context of the changes to the swap
logic. Sure they'll improve KVM too if we do those, but that'd be the
side effect of having improved the desktop swap behavior in general.
We improve the kernel for non-virt usage to benchmark-beat Xen/vmware
etc... There's nothing I'm doing in the VM that improves only KVM
(even the mmu notifier are used by GRU and stuff, more recently a new
pci card doing remote DMA from AMD is using mmu notifier too,
infiniband could do that too).
In fact it can't actually recall a single kernel change I did over the
last few years that would improve only KVM :).
> KVM patches by a different/easier standard than you are applying
> to a simple frontswap patchset that's been public for nearly
> three years.
I'm perfectly fine if frontswap gets in.... as long as it is the way
to go for the future of the Linux VM. Ignore virt here please, no KVM,
no Xen (even no cgroups just in case they could matter). Just tmem and
bare metal.
> There is no tmem dependency on zcache. Feel free to rewrite
> zcache entirely. It still needs the hooks in the frontswap
> patch, or something at least very similar.
That I agree, the hooks probably would be similar.
> Then feel free to rewrite that code.. or wait until it gets
> fixed. I agree that it's unlikely that zcache will be promoted
> out of staging with that hack. That's all still unrelated to
> merging frontswap.
frontswap don't go in staging so the moment you add a dependency on
staging/zcache code from the core VM code, we've to look into what is
being called too... Otherwise again we get hook requests every year
from whatever new user that does something a little weird. Not saying
this is the case, but just reading the hooks and pretending they're
non-intrusive and quite similar to what would have to be done anyway,
isn't the convincing method. I will like it if I'm convinced that tmem
that is being called is the future way for the VM to handle
compression dynamically with direct control of the Linux VM (that is
needed or it can't shrink when mlockall program grows). And not some
Xen hack that can't be modified or Xen ABI breaks. You see there's a
whole lot of difference... Once it'll be proven that tmem is the
future way for the VM to go to do dynamic compression and compaction
of the data + writing it to disk when VM pressure increases, I don't
think anybody will argue on the frontswap hooks.
> Zcache doesn't need to pass through tmem.c. RAMster is using tmem.c
> but isn't even in staging yet.
That's what I had the feeling in fact, it looked like zcache could
work by its own without calling in tmem. But I guess tmem is still
needed to manage the memory pooling using by zcache?
> > The whole logic deciding the size of the frontswap zcache is going to
> > be messy.
>
> It's not messy, and is entirely dynamic. Finding the ideal
> heuristics for the maximum size, and when and how much to
> decompress pages back out of zcache back into the swap cache,
> I agree, is messy and will take some time.
That's what I intended to be messy... the heuristic to find the
maximum zcache size. And that requires feedback from the VM to shrink
fast if we're squeezed by mlocked RAM.
And yes it's better than zram without any doubt, there's no way to
squeeze zram out... :) But the tradeoff is you lose a fixed amount of
RAM with zram and overall it should help and it's non intrusive. It
doesn't require a magic heuristic to size it dynamically etc...
The major benefit of zcache should be:
1) dynamic sizing (but adding complexity)
2) ability later to compact the compressed memory and write it to disk
compacted when a shrink is requested by the VM pressure (and by the
core VM code)
> Still not sure how this is related to the proposed frontswap
> patch now (which just provides some mechanism for the heuristics
> to drive).
>
> > But to do the real swapout you should not pull the memory
> > out of frontswap zcache, you should write it to disk compacted and
> > compressed compared to how it was inserted in frontswap... That would
> > be the ideal.
>
> Agreed, that would be cool... and very difficult to implement.
Glad we agree :).
> Dynamic. Pulled out with frontswap_shrink, see above.
Got it now.
> I guess you are missing the key magic for RAMster, or really
> for tmem. Because everything in tmem is entirely dynamic (e.g.
> any attempt to put a page can be rejected), the "remote" machine
> has complete control over how many pages to accept from whom,
> and can manage its own needs as higher priority. Think of
> a machine in RAMster as a KVM/Xen "host" for a bunch of
> virtual-machines-that-are-really-physical-machines. And it
> is all peer-to-peer, so each machine can act as a host when
> necessary. None of this is possible through anything that
> exists today in the swap subsystem or blkio subsystem.
> And RAMster runs on the same cleancache and frontswap hooks
> as Xen and zcache and, potentially, KVM.
>
> Yeah, the heuristics may be even harder for RAMster. But
> the first response to this thread (from Christoph) said
> that this stuff isn't sexy. Personally I can't think of
> anything sexier than the first CROSS-MACHINE memory management
> subsystem in a mainstream OS. Again... NO additional core
> VM changes.
I see the point. And well this discussion certainly helps to clarify
this further (at least for me).
Another question is if you can stack these things on top of each
other. Like ramster over zcache. Because if that's possible you'd need
to write a backend to write out the tmem memory to disk and allowing
tmem to swap that way. And then you could also use ramster on
compressed pagecache. A system with little RAM could want compression
and if we're out of pagecache to share it through ramster but once we
have it compressed why not to send it compressed to other tmem in the
cloud?
> Is that answered clearly now?
Yep :).
> Think RAMster. Think a version of RAMster with a "memory server"
> (where the RAM expandability is in one server in a rack). Think
> fast SSDs that can be attached to one machine and shared by other
> machines. Think phase-change (or other future limited-write-cycle)
> RAM without a separate processor counting how many times a cell
> has been written. This WAS all about Xen a year or two ago.
> I haven't written a line of Xen in over a year because I am
> excited about the FULL value of tmem.
I understand this. I'd just like to know how much this is hackable, or
if the Xen dependency (that still remains) is a limitation for future
extension or development. I mean the value of the Xen part is zero in
my view, so if we add something like this it should be hackable and
free to evolve for the benefit of the core VM regardless of
whatever Xen API/ABI. That in short is my concern. No much
else... doesn't need to be perfect as long as it's hackable and there
is no resistence to fix things like below:
> Yep, let's fix that problem in zcache. That is a stupid
> coding error by me and irrelevant to frontswap and the bigger
> transcendent memory picture.
Ok! Glad to hear.
> Zram required exactly ONE change to the VM, and Nitin placed it
> there AFTER he looked at how frontswap worked. Then he was forced
> down the "gotta do it as a device" path which lost a lot of the
> value. Then, when he wanted to do compression on page cache, he
> found that the cleancache interface was perfect for it. Why
> does everyone keep telling me to "do it like zram" when the author
> of zram has seen the light? Did I mention Nitin's support for
> frontswap already? https://lkml.org/lkml/2011/10/28/8
>
> So, I repeat, where are we now? Have I sufficiently answered
> your concerns and questions? Or are you going to go start
> coding to prove me wrong with a swap subsystem rewrite? :-)
My argument about zram is currently frontswap+zcache provides very
little value addition to zram (considering you said before there's no
shrinker being called and whatever heuristic you're using today won't
be able to react in a timely fascion to a mlockall growing fast). So
if we add hooks on the core VM to depend on it, we need to be sure
it's hackable and allowed to improve without worrying about breaking
Xen later. I mean Xen may still work if modified for it, but there
shall be no such thing as an API or ABI that cannot be
broken. Otherwise again it's better you add a few Xen specific hacks
and then we evolve tmem separately from it.
And I still see the real value I would see from zcache+frontswap is if
we can add into zcache/tmem the code to compact the fragments and
write them into swap pages, kind of like tail packing in the
filesystem in fs/ (absolutely unrelated to blkdev layer).
If you confirm it's free to go and there's no ABI/API we get stuck
into, I'm fairly positive about it, it's clearly "alpha" feature
behavior (almost no improvement with zram today) but it could very
well be in the right direction and give huge benefit compared to zram
in the future. I definitely don't pretend things to be perfect... but
they must be in the right design direction for me to be sold off on
those. Just like KVM in virt space.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/