Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Andy Nelson
Date: Thu Nov 03 2005 - 20:01:48 EST






Linus writes:

>Just face it - people who want memory hotplug had better know that
>beforehand (and let's be honest - in practice it's only going to work in
>virtualized environments or in environments where you can insert the new
>bank of memory and copy it over and remove the old one with hw support).
>
>Same as hugetlb.
>
>Nobody sane _cares_. Nobody sane is asking for these things. Only people
>with special needs are asking for it, and they know their needs.


Hello, my name is Andy. I am insane. I am one of the CRAZY PEOPLE you wrote
about. I am the whisperer in people's minds, causing them to conspire
against sanity everywhere and make lives as insane and crazy as mine is.
I love my work. I am an astrophysicist. I have lurked on various linux
lists for years now, and this is my first time standing in front of all
you people, hoping to make you bend your insane and crazy kernel developing
minds to listen to the rantings of my insane and crazy HPC mind.

I have done high performance computing in astrophysics for nearly two
decades now. It gives me a perspective that kernel developers usually
don't have, but sometimes need. For my part, I promise that I specifically
do *not* have the perspective of a kernel developer. I don't even speak C.

I don't really know what you folks do all day or night, and I actually
don't much care except when it impacts my own work. I am fairly certain
a lot of this hotplug/page defragmentation/page faulting/page zeroing
stuff that the sgi and ibm folk are currently getting rejected from
inclusion in the kernel impacts my work in very serious ways. You're
right, I do know my needs. They are not being met and the people with the
power to do anything about it call me insane and crazy and refuse to be
interested even in making improvement possible, even when it quite likely
helps them too.

Today I didn't hear a voice in my head that told me to shoot the pope, but
I did I hear one telling me to write a note telling you about my issues,
which apparently are in the 0.01% of insane crazies that should be
ignored, as are about 1/2 of the people responding on this thread.
I'll tell you a bit about my issues and their context now that things
have gotten hot enough that even a devout lurker like me is posting. Some
of it might make sense. Other parts may be internally inconsistent if only
I knew enough. Still other parts may be useful to get people who don't
talk to each other in contact, and think about things in ways they haven't.


I run large hydrodynamic simulations using a variety of techniques
whose relevance is only tangential to the current flamefest. I'll let you
know the important details as they come in later. A lot of my statements
will be common to a large fraction of all hpc applications, and I imagine
to many large scale database applications as well though I'm guessing a
bit there.

I run the same codes on many kinds of systems from workstations up
to large supercomputing platforms. Mostly my experience has been
in shared memory systems, but recently I've been part of things that
will put me into distributed memory space as well.

What does it mean to use computers like I do? Maybe this is surprising
but my executables are very very small. Typically 1-2MB or less, with
only a bit more needed for various external libraries like FFTW or
the like. On the other hand, my memory requirements are huge. Typically
many GB, and some folks run simulations with many TB. Picture a very
small and very fast flea repeatedly jumping around all over the skin of a
very large elephant, taking a bite at each jump and that is a crude idea
of what is happening.


This has bearing on the current discussion in the following ways, which
are not theoretical in any way.


1) Some of these simulations frequently need to access data that is
located very far away in memory. That means that the bigger your
pages are, the fewer TLB misses you get, the smaller the
thrashing, and the faster your code runs.

One example: I have a particle hydrodynamics code that uses gravity.
Molecular dynamics simulations have similar issues with long range
forces too. Gravity is calculated by culling acceptable nodes and atoms
out of a tree structure that can be many GB in size, or for bigger
jobs, many TB. You have to traverse the entire tree for every particle
(or closely packed small group). During this stage, almost every node
examination (a simple compare and about 5 flops) requires at least one
TLB miss and depending on how you've laid out your array, several TLB
misses. Huge pages help this problem, big time. Fine with me if all I
had was one single page. If I am stupid and get into swap territory, I
deserve every bad thing that happens to me.

Now you have a list of a few thousand nodes and atoms with their data
spread sparsely over that entire multi-GB memory volume. Grab data
(about 10 double precision numbers) for one node, do 40-50 flops with
it, and repeat, L1 and TLB thrashing your way through the entire list.
There are some tricks that work some times (preload an L1 sized array
of node data and use it for an entire group of particles, then discard
it for another preload if there is more data; dimension arrays in the
right direction, so you get multiple loads from the same cache line
etc) but such things don't always work or aren't always useful.

I can easily imagine database apps doing things not too dissimilar to
this. With my particular code, I have measured factors of several (\sim
3-4) speedup with large pages compared to small. This is measured on
an Origin 3000, with 64k, 1M and 16MB pages were used. Not a factor
of several percent. A factor of several. I have also measured similar
sorts of speedups on other types of machines. It is also not a factor
related to NUMA. I can see other effects from that source and can
distinguish between them.

Another example: Take a code that discretizes space on a grid
in 3d and does something to various variables to make them evolve.
You've got 3d arrays many GB in size, and for various calculations
you have to sweep through them in each direction: x, y and z. Going
in the z direction means that you are leaping across huge slices of
memory every time you increment the grid zone by 1. In some codes
only a few calculations are needed per zone. For example you want
to take a derivative:

deriv = (rho(i,j,k+1) - rho(i,j,k-1))/dz(k)

(I speak fortran, so the last index is the slow one here).

Again, every calculation strides through huge distances and gets you
a TLB miss or several. Note for the unwary: it usually does not make
sense to transpose the arrays so that the fast index is the one you
work with. You don't have enough memory for one thing and you pay
for the TLB overhead in the transpose anyway.

In both examples, with large pages the chances of getting a TLB hit
are far far higher than with small pages. That means I want truly
huge pages. Assuming pages at all (various arches don't have them
I think), a single one that covered my whole memory would be fine.


Other codes don't seem to benefit so much from large pages, or even
benefit from small pages, though my experience is minimal with
such codes. Other folks run them on the same machines I do though:


2) The last paragraph above is important because of the way HPC
works as an industry. We often don't just have a dedicated machine to
run on, that gets booted once and one dedicated application runs on it
till it dies or gets rebooted again. Many jobs run on the same machine.
Some jobs run for weeks. Others run for a few hours over and over
again. Some run massively parallel. Some run throughput.

How is this situation handled? With a batch scheduler. You submit
a job to run and ask for X cpus, Y memory and Z time. It goes and
fits you in wherever it can. cpusets were helpful infrastructure
in linux for this.

You may get some cpus on one side of the machine, some more
on the other, and memory associated with still others. They
do a pretty good job of allocating resources sanely, but there is
only so much that it can do.

The important point here for page related discusssions is that
someone, you don't know who, was running on those cpu's and memory
before you. And doing Ghu Knows What with it.

This code could be running something that benefits from small pages, or
it could be running with large pages. It could be dynamically
allocating and freeing large or small blocks of memory or it could be
allocating everything at the beginning and running statically
thereafter. Different codes do different things. That means that the
memory state could be totally fubar'ed before your job ever gets
any time allocated to it.

>Nobody takes a random machine and says "ok, we'll now put our most
>performance-critical database on this machine, and oh, btw, you can't
>reboot it and tune for it beforehand".

Wanna bet?

What I wrote above makes tuning the machine itself totally ineffective.
What do you tune for? Tuning for one person's code makes someone else's
slower. Tuning for the same code on one input makes another input run
horribly.

You also can't be rebooting after every job. What about all the other
ones that weren't done yet? You'd piss off everyone running there and
it takes too long besides.

What about a machine that is running multiple instances of some
database, some bigger or smaller than others, or doing other kinds
of work? Do you penalize the big ones or the small ones, this kind
of work or that?


You also can't establish zones that can't be changed on the fly
as things on the system change. How do zones like that fit into
numa? How do things work when suddenly you've got a job that wants
the entire memory filled with large pages and you've only got
half your system set up for large pages? What if you tune the
system that way and then let that job run. For some stupid reason user
reason it dies 10 minutes after starting? Do you let the 30
other jobs in the queue sit idle because they want a different
page distribution?

This way lies madness. Sysadmins just say no and set up the machine
in as stably as they can, usually with something not too different
that whatever manufacturer recommends as a default. For very good reasons.

I would bet the only kind of zone stuff that could even possibly
work would be related to a cpu/memset zone arrangement. See below.


3) I have experimented quite a bit with the page merge infrastructure
that exists on irix. I understand that similar large page and merge
infrastructure exists on solaris, though I haven't run on such systems.
I can get very good page distributions if I run immediately after
reboot. I get progressively worse distributions if my job runs only
a few days or weeks later.

My experience is that after some days or weeks of running have gone
by, there is no possible way short of a reboot to get pages merged
effectively back to any pristine state with the infrastructure that
exists there.

Some improvement can be had however, with a bit of pain. What I
would like to see is not a theoretical, general, all purpose
defragmentation and hotplug scheme, but one that can work effectively
with the kinds of constraints that a batch scheduler imposes.
I would even imagine that a more general scheduler type of situation
could be effective it that scheduler was smart enough. God knows,
the scheduler in linux has been rewritten often enough. What is
one more time for this purpose too?

You may claim that this sort of merge stuff requires excessive time
for the OS. Nothing could matter to me less. I've got those cpu's
full time for the next X days and if I want them to spend the first
5 minutes or whatever of my run making the place comfortable, so that
my job gets done three days earlier then I want to spend that time.


3) The thing is that all of this memory management at this level is not
the batch scheduler's job, its the OS's job. The thing that will make
it work is that in the case of a reasonably intelligent batch scheduler
(there are many), you are absolutely certain that nothing else is
running on those cpus and that memory. Except whatever the kernel
sprinkled in and didn't clean up afterwards.

So why can't the kernel clean up after itself? Why does the kernel need
to keep anything in this memory anyway? I supposedly have a guarantee
that it is mine, but it goes and immediately violates that guarantee
long before I even get started. I want all that kernel stuff gone from
my allocation and reset to a nice, sane pristine state.

The thing that would make all of it work is good fragmentation and
hotplug type stuff in the kernel. Push everything that the kernel did
to my memory into the bitbucket and start over. There shouldn't be
anything there that it needs to remember from before anyway. Perhaps
this is what the defragmentation stuff is supposed to help with.
Probably it has other uses that aren't on my agenda. Like pulling out
bad ram sticks or whatever. Perhaps there are things that need to be
remembered. Certainly being able to hotunplug those pieces would do it.
Just do everything but unplug it from the board, and then do a hotplug
to turn it back on.


4) You seem to claim that issues I wrote about above are 'theoretical
general cases'. They are not, at least not to any more people than the
0.01% of people who regularly time their kernel builds as I saw someone
doing some emails ago. Using that sort of argument as a reason not to
incorporate this sort of infrastructure just about made me fall out of
my chair, especially in the context of keeping the sane case sane.
Since this thread has long since lost decency and meaning and descended
into name calling, I suppose I'll pitch in with that too on two fronts:

1) I'd say someone making that sort of argument is doing some very serious
navel gazing.

2) Here's a cluebat: that ain't one of the sane cases you wrote about.



That said, it appears to me there are a variety of constituencies that
have some serious interest in this infrastructure.

1) HPC stuff

2) big database stuff.

3) people who are pushing hotplug for other reasons like the
bad memory replacement stuff I saw discussed.

4) Whatever else the hotplug folk want that I don't follow.


Seems to me that is a bit more than 0.01%.



>When you hear voices in your head that tell you to shoot the pope, do you
>do what they say? Same thing goes for customers and managers. They are the
>crazy voices in your head, and you need to set them right, not just
>blindly do what they ask for.

I don't care if you do what I ask for, but I do start getting irate and
start writing long annoyed letters if I can't do what I need to do, and
find out that someone could do something about it but refuses.

That said, I'm not so hot any more so I'll just unplug now.


Andy Nelson



PS: I read these lists at an archive, so if responders want to rm me from
any cc's that is fine. I'll still read what I want or need to from there.



--
Andy Nelson Theoretical Astrophysics Division (T-6)
andy dot nelson at lanl dot gov Los Alamos National Laboratory
http://www.phys.lsu.edu/~andy Los Alamos, NM 87545











-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/