Re: Some questions about linux kernel.

From: Jesse Pollard (pollard@tomcat.admin.navo.hpc.mil)
Date: Thu Mar 23 2000 - 09:04:13 EST


Marco Colombo <marco@esi.it>:
> [...]
> > Good summation and review..
>
> thankx. B-)
>
> [little process vs. memory-hog snipped]
>
> > Note: both processes were allowed to run. There were sufficient real resources
> > available. If there was enough for the simulation, and you started your
> > process, which process would have been aborted? the simulation, or yours?
> > Which was deemed more importatant for the organization at the time?
>
> I agree. But i was not considering quotas. Just plain paging. Little
> process "A" needs 5 pages to run. Big process "B" needs, say 20000.
> There are only 16000 page-frames on the system. If process "A" does not
> spin on all its 5 pages, is possible the the kernel pages-out one of
> those 5 pages. This is not fair. B is the hog. A being left alone, will run
> happily without ever page-faulting. And B won't notice the difference.
> You increase system throughput. The scheduler applies the same fairness
> principle when allocating the CPU.
>
> The whole point is that paging itself can be said unfair. Every kernel
> routine is bad, if you build up a worst case example. Not a good reason
> to change that.
>
> > > malloc() is just a user space "memory" allocator. It's not a kernel
> > > interface. What ever structures it uses, of what side effects you have
> > > is out of question. It uses brk() to extend a process VA space when
> > > needed, in order to handle more *addresses*, not more RAM. The kernel
> > > does not grant anything but a set virtual addresses.
> > > In legacy OSes, as a *side effect* of the swap space pre-allocation,
> > > it turns out that those addresses are always valid, either in RAM or
> > > paged out. But I don't think this is implied in the brk() semantic.
> > >
> > > Experience has already shown that in a general purpose OS such as Linux
> > > swap space "overcommitting" is a win. On special cases, of course, it's
> > > not. Provinding special, worst case examples for actual behaviour
> > > won't make it disappear. Expecially if the examples are life support systems
> > > or cruise missiles. Expecially if you're using those example to compare
> > > Linux and NT based on stability.
> >
> > It is a win in a single user environment. It is a catastrophe waiting to
> > happen in a multiuser environment. I need to prevent the failures.
>
> No. It's a win the more active processes you have. It increases system
> throughput.

It depends on the job mix - If there is one process that requires the
entire CPU (and 70% resident memory, no swap). The throuput goes down with
the additional 1 process (not much, just the context switch, cache reload...).
If the additional process needs 40% of memory then both processes loose -
context switch, cache reload, and paging.

Now the I/O overlap does provide some CPU for another process. In the past -
(my expierence) it is best to not have more that 2 or 3 active processes
per CPU. This depends on the time delay the I/O subsystem takes. If it is
a poor I/O system, then more active processes can be there. This starts
taking a toll when the swap activity also increases (processes are alread
in an I/O wait...). At this point, additional processes will not improve
throughput.

> > > In pre 2.2 times, I've never seen a Linux box crash for OOS. Nor processes
> > > being randomly killed. The system just deadlocked. With X running, sometimes
> > > I managed to "recover" OOS just hitting CRTL-ALT-BACKSPACE and *after a
> > > few hours*, X exited bringing the memory hog with it, and i had my system
> > > back. I'm not saying it was good. It was a choice. An OOM (or better OOS)
> > > killer is another one.
> >
> > I have, at least I believe thats what happend. I couldn't be there running
> > top at the instant it failed (it rebooted itself).
> >
> > > I think your swap space should be dimensioned in a way that you hit
> > > the OOM situation before a OOS one. And you just want to cure OOM, the
> > > performaces dropping so much in such a case. So OOS should never happen.
> > >
> > > Special cases do exist. Handle them in a special way.
> > > Don't use MMU on life-support systems (or missiles).
> > > Have your big simulation program handle its own (huge) backing store.
> >
> > How is that enforced?
>
> Save computed data to disk and reuse memory. Memory should be locked in
> RAM. Beware of stack grows. And most of all run on dedicated hardware.

That reduces throughput, and make a remendous amount of idle capacity. If every
job had its' own computer system, tuned to just that job, then you get
the best throughput, but the worst economic use. That was why multi-user
systems were created.

> > How does that simplify the difficulty of programming it in the first place?
>
> Big & critical apps already have complex programming.
> Database systems bypass most of the FS/buffercache features.
> You can't ask to remove the buffer cache just to allow you to easily
> write your database server.
>
> > > Fix buggy software that memory hogs.
> > > Disable malicious users's accounts, or set ridicusly low resurce limits for
> > > them. And don't run critical software on general purpose systems.
> >
> > Weather forcasting is considered critical. Yet they are run on general purpose
> > systems because there is no such thing a a "weather computer". They are also
> > considered memory hogs.
>
> They run on dedicated harware. Non on a system with public shell access,
> I hope. If you see 20 vi processes, 2 nethacks, 5 netscapes, a couple of
> gccs (with as and friends), 30 logged users, AND one weather simulation
> program (production and critical), there's something wrong.
>
> > > Use RT OSes when needed.
> > > Use resource limits and *limit the number of concurrent users* if you
> > > want strict control of them. If availability is you only concern, triple
> > > your budget.
> >
> > There are no enforcable resource limits. It is cheaper to use resource controls
> > and be able to justify the "triple" budget.
>
> We agree, so. If you have infinite money, you don't need overcommitting.
> And, BTW, you don't need swap space. Just buy 265TB of RAM.
> Thrashing can get your system to its knees anyway. Are you suggesting
> to remove the paging system to avoid thrashing? Doesn't the kernel
> "overcommit" when it allows process allocate more memory than the available
> RAM? Doesn't your low-latency critical app fail, when it serves data at
> disk speed, instead of RAM speed, because the kernel paged that out, without
> even notifing your app? Isn't this unpredictable behaviour? Doesn't
> malloc() grant you *memory*, not disk space? How far will you push your
> arguments?
>
> You seem to be fine with paging and the kernel granting more "memory"
> than it has available, and using disk when needed.
> I'm not. When I setup a proxy server, i don't like performances to drop
> because the system is short of RAM (but it has plenty of swap, so malloc
> succeeds, and squid is happy with it, and uses swap space as "RAM cache").

True - I'm not saying that dedicated function systems are not usefull.
Sometimes they are a necessity. This is aimed more at the multi-function and/or
multi-user system. These systems tend to be the larger ones available. Those
systems that have a 300-1000GB of disk storage, generate data sets between 2-26
GB, have 4 - 200 CPUs. These are usually multi-user systems. They are also
used for dedicated functions (I think the stock exchange uses dual high
availabiltiy servers in this range).

These are not small, single user systems.

> If this RAM shortage is caused by internal reasons (memory leakage in squid)
> I need to fix the app. If RAM shortage is caused by external causes, such
> as overload or DoS attempt, even restarting the server (or the whole
> system) won't help.

I may not be able to - I can report it to the vendor (ie oracle servers) but
I still have to run them as-is. This is where the resource limits help, they
are not a final solution (that takes fixing it).

> > > If you want to make best use of available resources, just use Linux (and
> > > its overcommitting mm system).
> >
> > Except when the use by multiple users collide without a way to resolve
> > the conflict other than aborting random user processes, or by crashing.
>
> On OOM/OOS situation, a non-overcommitting OS will deny execution of random
> user processes. You need user-level resource limits. It has nothing to
> do with overcommitting address space.

Yes. This is where the "best use" comes in. It must penalize someone, and quota
controles identify the individual user that will get the penalty.

> > > What's wrong with that?
> >
> > Nothing at all. I want to use linux where it is most appropriate. I also see
> > the lack of the option to have quotas as artificially limiting Linux to
> > just single user workstations.
>
> To effectively run a multi user system unattended you need much more the
> user-level resource limits. And i feel that "critical" application deserve
> dedicated hardware anyway.

Oh yes. This is another brick in the support. Linux itself has shown that the
foundation is VERY solid. This kernel works (nearly) forever when dedicated
to a single function and with a bit of overkill in hardware (mostly memory).

Part of my consideration here is for things like large beowulf clusters, the
time when Linux runs on a 256 node Origin 2000, or other equivalent level
systems (hey - when is it going to run on 1024 node SP system? When I asked
IBM salesmen they just talked about linux not being reliable enough, and how
I didn't want to run a non-production level OS on a $30-40M system... It almost
sounded like he just didn't want to say "never").

When the "critical application" is weather/ocean modeling - that system will
also be used to develop/extend the application. Very few people can afford
two such systems - one for the application, and another for development.
-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:38 EST