Re: Some questions about linux kernel.

From: Marco Colombo (marco@esi.it)
Date: Tue Mar 21 2000 - 06:22:20 EST


On Mon, 20 Mar 2000, Richard B. Johnson wrote:

> Malloc(), as stated before, just sets a new break address when it
> runs out of heap. It keeps track of the heap, but not very carefully.

[ paging description snipped]

> A problem occurs when there are no longer any free pages to steal.

No. The problem we're discussing occurs when there's no more backing
storage on the swap device(s). If there aren't any "stealable" page-frames,
that's another matter. It means the whole RAM is 'locked' by some means.
It's a deadlock situation, AFAIK, and SHOULD never happen. And I think
it can happen only with a kernel bug.

> When malloc() attempts to set a new break address, it sets up a
> handler. Then it calls the kernel to set a new break address.
> Malloc(), before accepting this address, could write a word of
> zeros to the top allocation. This could cause a page-fault. If
> the page-fault handler could not fault in a new page, it could
> send a signal to the process (received my malloc()). Malloc
> can then return NULL for the current allocation request. In this
> manner, the caller of malloc() would always be assured that memory
> was available.

It has to touch all the pages. And, even if probably its not so in
the current implementation, I wonder if, once a page is paged-in,
its space on the swap device is made available to other pages or not.
It could be. In such a case, you can't be sure even if you touch all
the pages at malloc time.

> Unfortunately, this is naive. The first time the break address was
> extended, this would work. However, what happens after the kernel
> steals pages from your task to satisfy other requests? Eventually
> pages that you thought you owned, have to be faulted in. There may
> be no more pages to steal so you, thinking you have safely allocated
> real pages, are now deadlocked --and dead.

What do you mean with 'steal'? You mean paging out or in (and freeing
swap space)? Page faults are not a problem here. It's paging-out
when there no more swap storage.

> The only solution to an out-of-memory condition is to never run
> out of memory. The place where all of the system information is
> known is in "user space". The kernel readily "knows" stuff about the
> current process, but retrieving information about other tasks in
> a page-fault handler would result in an extremely poor performing
> machine. A user-space daemon can acquire information about all the
> tasks, can detect runaway tasks, can safeguard special tasks like
> Web Servers that haven't gone crazy, and can watch for performance
> hurting rogue programs.
>
> Such a program, if properly designed, is the solution to such
> out-of-memory conditions.

User or kernel land does not really matter. Such a program, either
'overcommits' or not. If it does, it has to choose deadlock or kill.
If it does not, the kernel could do that pretty as well.

>
> Cheers,
> Dick Johnson
>
> Penguin : Linux version 2.3.41 on an i686 machine (800.63 BogoMips).
>
>

[What follows is an aswer to many other messages posted on the
matter, by many authors - basicly my thoughts on the matter]

In the whole thread, people keep misusing the words "OOM", "memory",
"virtual" and so on. I agree that what you write is almost true. I'd like
to make clear that "virtual RAM" is something that does not exist.
There's RAM, call it just "memory". There's swap space, which is used
to support paging (AFAIK, Linux does not do swapping). And there's
a process virtual address space, which is represented as PTEs (and
other structures) in kernel land. You never, stricly speacking, go OOM,
if you do paging. The kernel already let processes grow in size more than
the available memory. During normal operation, if you see >0% swap
utilization, you're OOM. But it is known that a process does not need
all the memory it allocates at once. So, to increase system throughput
(the number of processes you can run in a given time interval) and
concurrency (strictly related), you use paging. You never go OOM here
(infinite swap space assumed, of course), but you do get problems if the
amount of memory *needed* (not just allocated) by all processes at a
given time is bigger than the RAM you have: the mm system just keeps
paging in and out, reducing memory accesses to I/O speed
(and loading the I/O subsystem at the same time). That's what *I* call OOM.
You get there slowly, and you're really in trouble when even the I/O
subsystem can't keep up with it. The system is stalled.
The whole thread (and related ones) IS NOT about this.

Since swap space is not infinite, you may fill it up, and the you're OOS
(out of swap). It's not OOM. The behaviour is completely different. With
just few hundreds KB free (on the swap device), the system is all fine.
A few seconds later, it's completely stalled, in such a unrecoverable way
that (for many people) a process Killer is the only solution.
Actions that should be taken when OOS is detected is what people are
discussing now.

"Virtual memory" is a completely different, unrelated thing. The kernel
provides a private virtual address space to processes. Memory addresses
are "virtual", meaning that you write at address 0xff00, but only the
kernel knows where the process is writing in RAM. This also means that
you can run 10 copies of 'ls', each writing to addres 0xff00 but actually
writing to 10 different page-frames in RAM, the processes *not being aware*
of it: that's why its called "virtual". The whole thing is "virtual
address space" not "virtual memory". You could have paging WITHOUT addresses
translation, and processes could even handle that by themselves (think of
"overlays"). And you can have VA WITHOUT paging. Of course, virtual
addressing just integrates too well with a paging system, so we have both.

The whole thing about "malloc / overcommiting" arose because in legacy
UNIX implementations the kernel *before* extending a process VA space,
it allocated enough backing storage on swap to *completely* swap it
out ('swap' it, not 'page' it!). This is required for 'swapping'.
It has been noticed that no all the processes will be swapped out at
the same time. That is, the kernel does not use all the swap space
it allocates at the same time. The same optimization you made for RAM,
you can make for swap. Again, you increase system throughput. With
paging, where only part of the process address space is on swap on a
given time, it's even better.

Both optimizations come at a price. Without paging, OOM situation is
handled much better (in a way). And a process address space is always
there (in RAM), as far as the process is concerned. With (just) swapping,
this is true. Either the process is completely swapped-out (so being
'dead'), or it's address space is completely valid. Yes, can make many
examples of how useful this can be. For every low-latency application,
that needs to react in a timely manner to external events, paging is a
bad thing. When its event loop calls the handle_extern_event_no_1090()
function (that has not been recently called, so it has been eventually
paged-out) the application may sleep (completely unaware of it). Is
paging bad?
And, BTW, is it fair? My poor little process, using up just
100KB of address space, with a working set of 20KB, can't run smoothly
because of that big memory hog (simulation) which has 800MB of address
space, and a woking set of about 62M, which is causing so much paging in
and out on our 64MB system? Shouldn't paging system see that granting
me just 2 page-frames more my performances will just double, without
the big simulation even notice it? You can make worst case example for
almost every kernel design or implementation choice. Shouldn't 'ls -l'
perform better, if we put file ownership and permission info in the
directory entry, instead of in a separate structure?

malloc() is just a user space "memory" allocator. It's not a kernel
interface. What ever structures it uses, of what side effects you have
is out of question. It uses brk() to extend a process VA space when
needed, in order to handle more *addresses*, not more RAM. The kernel
does not grant anything but a set virtual addresses.
In legacy OSes, as a *side effect* of the swap space pre-allocation,
it turns out that those addresses are always valid, either in RAM or
paged out. But I don't think this is implied in the brk() semantic.

Experience has already shown that in a general purpose OS such as Linux
swap space "overcommitting" is a win. On special cases, of course, it's
not. Provinding special, worst case examples for actual behaviour
won't make it disappear. Expecially if the examples are life support systems
or cruise missiles. Expecially if you're using those example to compare
Linux and NT based on stability.

In pre 2.2 times, I've never seen a Linux box crash for OOS. Nor processes
being randomly killed. The system just deadlocked. With X running, sometimes
I managed to "recover" OOS just hitting CRTL-ALT-BACKSPACE and *after a
few hours*, X exited bringing the memory hog with it, and i had my system
back. I'm not saying it was good. It was a choice. An OOM (or better OOS)
killer is another one.

I think your swap space should be dimensioned in a way that you hit
the OOM situation before a OOS one. And you just want to cure OOM, the
performaces dropping so much in such a case. So OOS should never happen.

Special cases do exist. Handle them in a special way.
Don't use MMU on life-support systems (or missiles).
Have your big simulation program handle its own (huge) backing store.
Fix buggy software that memory hogs.
Disable malicious users's accounts, or set ridicusly low resurce limits for
them. And don't run critical software on general purpose systems.
Use RT OSes when needed.
Use resource limits and *limit the number of concurrent users* if you
want strict control of them. If availability is you only concern, triple
your budget.
If you want to make best use of available resources, just use Linux (and
its overcommitting mm system).
What's wrong with that?

.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:32 EST