RE: [GIT PULL] mm: frontswap (for 3.2 window)

From: Dan Magenheimer
Date: Mon Oct 31 2011 - 16:59:15 EST

> From: Andrea Arcangeli [mailto:aarcange@xxxxxxxxxx]
> Subject: Re: [GIT PULL] mm: frontswap (for 3.2 window)

Hi Andrea --

Thanks for your input. It's good to have some real technical
discussion about the core of tmem. I hope you will
take the time to read and consider my reply,
and comment on any disagreements.

OK, let's go over your concerns about the "flawed API."

> 1) 4k page limit (no way to handle hugepages)

FALSE. The API/ABI was designed from the beginning to handle
different pagesizes. It can even dynamically handle more than
one page size, though a different "pool" must be created on
the kernel side for each different pagesize. (At the risk
of derision, remember I used to code for IA64 so I am
very familiar with different pagesizes.)

It is true that the current tmem _backends_ (Xen and
zcache) reject pagesizes other than 4K, but if there are
"frontends" that have a different pagesize, the API/ABI
supports it.

For hugepages, I agree copying 2M seems odd. But talking
about hugepages in the swap subsystem, I think we are
talking about a very remote future. (Remember cleancache
is _already_ merged so I'm limiting this to swap.) Perhaps
in that far future, Intel will have an optimized "copy2M"
instruction that can circumvent cache pollution?

> 2) synchronous

TRUE. (Well, mostly.... RAMster is exploiting some asynchrony
but that's all still experimental.)

Remember the whole point of tmem/cleancache/frontswap is in
environments where memory is scarce and CPU is plentiful,
which is increasingly common (especially in virtualization).
We all cut our teeth on kernel work in an environment where
saving every CPU cycle was important, but in these new
memory-constrained many-core environments, the majority of
CPU cycles are idle. So does it really matter if the CPU is
idle because it is waiting on the disk vs being used for
synchronous copying/compression/dedup? See the published
Xen benchmarks: CPU utilization goes up, but throughput
goes up too. Why? Because physical memory is being used
more efficiently.

Also IMHO the reason the frontswap hooks and the cleancache
hooks can be so simple and elegant and can support many
different users is because the API/ABI is synchronous.
If you change that, I think you will introduce all sorts
of special cases and races and bugs on both sides of the
ABI/API. And (IMHO) the end result is that most CPUs
are still mostly sitting idle waiting for work to do.

> 3) not zerocopy, requires one bounce buffer for every get and one
> bounce buffer again for every put (like highmem I/O with 32bit pci)

Hmmm... not sure I understand this one. It IS copy-based
so is not zerocopy; the page of data is actually moving out
of memory controlled/directly-addressable by the kernel into
memory that is not controlled/directly-addressable by the kernel.
But neither the Xen implementation nor the zcache implementation
uses any bounce buffers, even when compressing or dedup'ing.

So unless I misunderstand, this one is FALSE.

> 4) can't handle batched requests

TRUE. Tell me again why a vmexit/vmenter per 4K page is
"impossible"? Again you are assuming (1) the CPU had some
real work to do instead and (2) that vmexit/vmenter is horribly
slow. Even if vmexit/vmenter is thousands of cycles, it is still
orders of magnitude faster than a disk access. And vmexit/vmenter
is about the same order of magnitude as page copy, and much
faster than compression/decompression, both of which still
result in a nice win.

You are also assuming that frontswap puts/gets are highly
frequent. By definition they are not, because they are
replacing single-page disk reads/writes due to swapping.

That said, the API/ABI is very extensible, so if it were
proven that batching was sufficiently valuable, it could
be added later... but I don't see it as a showstopper.
Really do you?

> worse than HIGHMEM 32bit... Obviously you must be mlocking all Oracle
> db memory so you won't hit that bounce buffering ever with
> Oracle. Also note, historically there's nobody that hated bounce
> buffers more than Oracle (at least I remember the highmem issues with
> pci32 cards :). Also Oracle was the biggest user of hugetlbfs.

I already noted that there's no bounce buffers, but Oracle is
not pursuing this because of the Oracle _database_ (though
it does work on single node databases). While "Oracle" is
often used to equate to its eponymous database, tmem works
on lots of workloads and Oracle (even pre-Sun-merger) sells
tons of non-DB software. In fact I personally take some heat
for putting more emphasis on getting tmem into Linux than in
using it to proprietarily improve other Oracle products.

> If I'm wrong please correct me, I hadn't lots of time to check
> code. But we already raised these points before without much answer.

OK, so you're wrong on two of the points and I've corrected
you. On two of the points, synchrony and non-batchability,
you make claims that (1) these are bad and (2) that there
is a better way to achieve the same results with asynchrony
and batchability.

I do agree you've raised the points before, but I am pretty
sure I've always given the same answers, so you shouldn't
say that you haven't gotten "much answer" but that you disagree
with the answer you got.

I've got working code, it's going in real distros and products and
has growing usage by (non-Oracle) kernel developers as well as
real users clamoring for it or already using it. You claim
that by making it asynchronous it would be better, while I claim
that it would make it impossibly complicated. (We'd essentially
be rewriting, or creating a parallel, blkio subsystem.) You claim
that a batch interface is necessary, while I claim that if it is
proven that it is needed, it could be added later.

We've been talking about this since July 2009, right?
If you can do it better, where's your code? I have the
highest degree of respect for your abilities and I have no
doubt that you could do something similar for KVM over a
long weekend... but can you also make it work for Xen, for
in-kernel compression, and for cross-kernel clustering
(not to mention for other "users" in my queue)? The foundation
tmem code in the core kernel (frontswap and cleancache)
is elegant in its simplicity and _it works_.

REALLY no disrespect intended and I'm sorry if I am flaming,
so let me calm down by quoting Linus from the LWN KS2011

"[Linus] stated that, simply, code that actually is used is
code that is actually worth something... code aimed at
solving the same problem is just a vague idea that is
worthless by comparison... Even if it truly is crap,
we've had crap in the kernel before. The code does not
get better out of tree."

So, please, all the other parts necessary for tmem are
already in-tree, why all the resistance about frontswap?

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at