RE: [GIT PULL] mm: frontswap (for 3.2 window)

From: Dan Magenheimer
Date: Tue Nov 01 2011 - 14:12:17 EST


> From: James Bottomley [mailto:James.Bottomley@xxxxxxxxxxxxxxxxxxxxx]
> Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
>
> On Mon, 2011-10-31 at 08:39 -0700, Dan Magenheimer wrote:
> > > From: James Bottomley [mailto:James.Bottomley@xxxxxxxxxxxxxxxxxxxxx]
> > > Subject: RE: [GIT PULL] mm: frontswap (for 3.2 window)
> >
> > > On Fri, 2011-10-28 at 13:19 -0700, Dan Magenheimer wrote:
> > > > For those who "hack on the VM", I can't imagine why the handful
> > > > of lines in the swap subsystem, which is probably the most stable
> > > > and barely touched subsystem in Linux or any OS on the planet,
> > > > is going to be a burden or much of a cost.
> > >
> > > Saying things like this doesn't encourage anyone to trust you. The
> > > whole of the MM is a complex, highly interacting system. The recent
> > > issues we've had with kswapd and the shrinker code gives a nice
> > > demonstration of this ... and that was caused by well tested code
> > > updates.
> >
> > I do understand that. My point was that the hooks are
> > placed _statically_ in largely stable code so it's not
> > going to constantly get in the way of VM developers
> > adding new features and fixing bugs, particularly
> > any developers that don't care about whether frontswap
> > works or not. I do think that is a very relevant
> > point about maintenance... do you disagree?
>
> Well, as I've said, all the mm code is highly interacting, so I don't
> really see it as "stable" in the way you suggest. What I'm saying is
> that you need to test a variety of workloads to demonstrate there aren't
> any nasty interactions.

I guess I don't understand how there can be any interactions
at all, let alone _nasty_ interactions when there is no
code to interact with?

For clarity and brevity, let's call the three cases:

Case A) CONFIG_FRONTSWAP=n
Case B) CONFIG_FRONTSWAP=y and no tmem backend registers
Case C) CONFIG_FRONTSWAP=y and a tmem backend DOES register

There are no interactions in Case A, agreed? I'm not sure
if it is clear, but in Case B every hook checks to
see if a tmem backend is registered... if not, the
hook is a no-op except for the addition of a
compare-pointer-against-NULL op, so there is no
interaction there either.

So the only case where interactions are possible is
Case C, which currently only can occur if a user
specifies a kernel boot parameter of "tmem" or "zcache".
(I know, a bit ugly, but there's a reason for doing
it this way, at least for now.)

> > Runtime interactions can only occur if the code is
> > config'ed and, if config'ed, only if a tmem backend (e.g.
> > Xen or zcache) enables it also at runtime.
>
> So this, I don't accept without proof ... that's what we initially said
> about the last set of shrinker updates that caused kswapd to hang
> sandybridge systems ...

This makes me think that you didn't understand the
code underlying Case B above, true?

> > When
> > both are enabled, runtime interactions do occur
> > and absolutely must be fully tested. My point was
> > that any _users_ who don't care about whether frontswap
> > works or not don't need to have any concerns about
> > VM system runtime interactions. I think this is also
> > a very relevant point about maintenance... do you
> > disagree?
>
> I'm sorry, what point about maintenance?

The point is that only Case C has possible interactions
so Case A and Case B end-users and kernel developers need
not worry about the maintenance.

IOW, if Johannes merges some super major swap subsystem rewrite
and he doesn't have a clue if/how to move the frontswap
hooks, his patch doesn't affect any Case A or Case B users
and not even any Case C users that aren't using latest upstream.

That seems relevant to me when we are discussing
how much maintenance cost frontswap requires which,
I think, was where this subthread started several
emails ago :-)

> > > You can't hand wave away the need for benchmarks and
> > > performance tests.
> >
> > I'm not. Conclusive benchmarks are available for one user
> > (Xen) but not (yet) for other users. I've already acknowledged
> > the feedback desiring benchmarking for zcache, but zcache
> > is already merged (albeit in staging), and Xen tmem
> > is already merged in both Linux and the Xen hypervisor,
> > and cleancache (the alter ego of frontswap) is already
> > merged.
>
> The test results for Xen I've seen are simply that "we're faster than
> swapping to disk, and we can be even better if you use self ballooning".
> There's no indication (at least in the Xen Summit presentation) what the
> actual workloads were.
>
> > So the question is not whether benchmarks are waived,
> > but whether one accepts (1) conclusive benchmarks for Xen;
> > PLUS (2) insufficiently benchmarked zcache; PLUS (3) at
> > least two other interesting-but-not-yet-benchmarkable users;
> > as sufficient for adding this small set of hooks into
> > swap code.
>
> That's the point: even for Xen, the benchmarks aren't "conclusive".
> There may be a workload for which transcendent memory works better, but
> make -j8 isn't enough of a variety of workloads)

OK, you got me, I guess "conclusive" is too strong a word.
It would be more accurate to say that the theoretical basis
for improvement, which some people were very skeptical about,
measures to be even better than expected.

I agree that one workload isn't enough... I can assure you that
there have been others. But I really don't think you are asking
for more _positive_ data, you are asking if there is _negative_
data. As you point out "we" are faster than swapping is not
a hard bar to clear. IOW comparing any workload that swaps a lot
against the same workload swapping a lot less, doesn't really
prove anything. OR DOES IT? Considering that reducing swapping
is the WHOLE POINT of frontswap, I would argue that it does.

Can we agree that if frontswap is doing its job properly on
any "normal" workload that is swapping, it is improving on a
bad situation?

Then let's get back to your implied question about _negative_
data. As described above there is NO impact for Case A
and Case B. (The zealot will point out that a pointer-compare
against-NULL per page-swapped-in/out is not "NO" impact,
but let's ignore him for now.) In Case C, there are
demonstrated benefits for SOME workloads... will frontswap
HARM some workloads?

I have openly admitted that for _cleancache_ on _zcache_,
sometimes the cost can exceed the benefits, and this was
actually demonstrated by one user on lkml. For _frontswap_
it's really hard to imagine even a very contrived workload
where frontswap fails to provide an advantage. I suppose
maybe if your swap disk lives on a PCI SSD and your CPU
is ancient single-core which does extremely slow copying
and compression?

IOW, I feel like you are giving me busywork, and any additional
evidence I present you will wave away anyway.

> > I understand that some kernel developers (mostly from one
> > company) continue to completely discount Xen, and
> > thus won't even look at the Xen results. IMHO
> > that is mudslinging.
>
> OK, so lets look at this another way: one of the signs of a good ABI is
> generic applicability. Any good virtualisation ABI should thus work for
> all virtualisation systems (including VMware should they choose to take
> advantage of it). The fact that transcendent memory only seems to work
> well for Xen is a red flag in this regard.

I think the tmem ABI will work fine with any virtualization system,
and particularly frontswap will. There are some theoretical arguments
that KVM will get little or no benefit, but those arguments
pertain primarily to cleancache. And I've noted that the ABI
was designed to be very extensible, so if KVM wants a batching
interface, they can add one. To repeat from the LWN KS2011 report:

"[Linus] stated that, simply, code that actually is used is
code that is actually worth something... code aimed at
solving the same problem is just a vague idea that is
worthless by comparison... Even if it truly is crap,
we've had crap in the kernel before. The code does not
get better out of tree."

AND the API/ABI clearly supports other non-virtualization uses
as well. The in-kernel hooks are very simple and the layering
is very clean. The ABI is extensible, has been published for
nearly three years, and successfully rev'ed once (to accomodate
192-bit exportfs handles for cleancache). Your arguments are on
very thin ice here.

It sounds like you are saying that unless/until KVM has a completed
measurable implementation... and maybe VMware and Hyper/V as well...
you don't think the tiny set of hooks that are frontswap should
be merged. If so, that "red flag" sounds self-serving, not what I
would expect from someone like you. Sorry.

> So what I don't like about this style of argument is the sleight of
> hand: I would expect the inactive but configured case to show mostly in
> the shrinker paths, which is where our major problems have been, so that
> would be cleancache, not frontswap, wouldn't it?

Yes, this is cleancache (already merged). As described
above, frontswap executes no code in Case A or Case B so
can't possibly interact with the shrinker path.

> > So the remaining question is the performance impact when
> > compile-time AND runtime enabled; this is in the published
> > Xen presentation I've referenced -- the impact is much much
> > less than the performance gain. IMHO benchmark results can
> > be easily manipulated so I prefer to discuss the theoretical
> > underpinnings which, in short, is that just about anything
> > a tmem backend does (hypercall, compression, deduplication,
> > even moving data across a fast network) is a helluva lot
> > faster than swapping a page to disk.
> >
> > Are there corner cases and probably even real workloads
> > where the cost exceeds the benefits? Probably... though
> > less likely for frontswap than for cleancache because ONLY
> > pages that would actually be swapped out/in use frontswap.
> >
> > But I have never suggested that every kernel should always
> > unconditionally compile-time-enable and run-time-enable
> > frontswap... simply that it should be in-tree so those
> > who wish to enable it are able to enable it.
>
> In practise, most useful ABIs end up being compiled in ... and useful
> basically means useful to any constituency, however small. If your ABI
> is useless, then fine, we don't have to worry about the configured but
> inactive case (but then again, we wouldn't have to worry about the ABI
> at all). If it has a use, then kernels will end up shipping with it
> configured in which is why the inactive performance impact is so
> important to quantify.

So do you now understand/agree that the inactive performance is zero
and the interaction of an inactive configuration with the remainder
of the MM subsystem is zero? And that you and your users will be
completely unaffected unless you/they intentionally turn it on,
not only compiled in, but explicitly at runtime as well?

So... understanding your preference for more workloads and your
preference that KVM should be demonstrated as a profitable user
first... is there anything else that you think should stand
in the way of merging frontswap so that existing and planned
kernel developers can build on top of it in-tree?

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/