Re: Some questions about linux kernel.

From: Marco Colombo (marco@esi.it)
Date: Thu Mar 23 2000 - 13:40:59 EST


On Thu, 23 Mar 2000, Jesse Pollard wrote:

> > No. It's a win the more active processes you have. It increases system
> > throughput.
>
> It depends on the job mix - If there is one process that requires the
> entire CPU (and 70% resident memory, no swap). The throuput goes down with
> the additional 1 process (not much, just the context switch, cache reload...).
Of course. I was talking of average case. There's always a worst case.

> If the additional process needs 40% of memory then both processes loose -
> context switch, cache reload, and paging.

Agreed again. You're just making a worst case example.

> Now the I/O overlap does provide some CPU for another process. In the past -
> (my expierence) it is best to not have more that 2 or 3 active processes
> per CPU. This depends on the time delay the I/O subsystem takes. If it is
> a poor I/O system, then more active processes can be there. This starts
> taking a toll when the swap activity also increases (processes are alread
> in an I/O wait...). At this point, additional processes will not improve
> throughput.

I agree definitely. That's exacly what i mean with "OOM". OOM = sum of
working sets of all processes exceeds avalable RAM. page-fault rate
increases, and performances quickly decrease. This may happen with lots of
free swap space. The only thing you can do is add RAM.

One of my arguments is precisely that: a OOM situation (heavy paging activity)
is much more common than OOS. You should take care of it, *before* you
reach OOS. A system which reaches OOS before OOM (so it runs out of space,
but with low paging activity) is possible, but uncommon. And you should
really just add more swap.

> > Save computed data to disk and reuse memory. Memory should be locked in
> > RAM. Beware of stack grows. And most of all run on dedicated hardware.
>
> That reduces throughput, and make a remendous amount of idle capacity. If every
> job had its' own computer system, tuned to just that job, then you get
> the best throughput, but the worst economic use. That was why multi-user
> systems were created.

Agreed. That's the price for an optimally tuned - critical service.

Does not reduce throughput: there should not be "a tremendous amount" of
idle capacity. If you put a I/O bound application on a 4-way SMP box,
it's your fault. If you put a 20KB-size process on a 8GB system, it's your
fault. If your application has a compute - write - read - compute
behaviour, so while writing - reading the CPU is idle, but during "compute"
you need horse power, well, it's badly engineered, should be multithreaded
(to put it simple), and have compute phase of one thread overlap the write-
read of the other one... we can go on forever...

The kernel provides general-purpose, poor man's mechanisms, that help
you running a set of "dumb" applications in an effective way.

gcc takes advantage of a buffer cache. It's "dumb" in its FS access, because
system buffer cache is good enough for in own purposes. Oracle probably
wants to bypass kernel buffer-caching (which is general-purpose)
because it can do better (knowing exacly what it needs).

[...]
> > We agree, so. If you have infinite money, you don't need overcommitting.
> > And, BTW, you don't need swap space. Just buy 265TB of RAM.
> > Thrashing can get your system to its knees anyway. Are you suggesting
> > to remove the paging system to avoid thrashing? Doesn't the kernel
> > "overcommit" when it allows process allocate more memory than the available
> > RAM? Doesn't your low-latency critical app fail, when it serves data at
> > disk speed, instead of RAM speed, because the kernel paged that out, without
> > even notifing your app? Isn't this unpredictable behaviour? Doesn't
> > malloc() grant you *memory*, not disk space? How far will you push your
> > arguments?
> >
> > You seem to be fine with paging and the kernel granting more "memory"
> > than it has available, and using disk when needed.
> > I'm not. When I setup a proxy server, i don't like performances to drop
> > because the system is short of RAM (but it has plenty of swap, so malloc
> > succeeds, and squid is happy with it, and uses swap space as "RAM cache").
>
> True - I'm not saying that dedicated function systems are not usefull.
> Sometimes they are a necessity. This is aimed more at the multi-function and/or
> multi-user system. These systems tend to be the larger ones available. Those
> systems that have a 300-1000GB of disk storage, generate data sets between 2-26
> GB, have 4 - 200 CPUs. These are usually multi-user systems. They are also
> used for dedicated functions (I think the stock exchange uses dual high
> availabiltiy servers in this range).
>
> These are not small, single user systems.

I hope they are not multiuser systems (where people log and use a shell).

If they are "used for dedicated functions", the application should be
controllable and tunable. You never never see OOM or OOS. If you do
because of a runaway application, you have problems with your stock exchange
system bigger than overcommiting or not. I hope such an application does not
randomly malloc()s pieces of address space without knowing how many
resources it's currently using.

Squid (just to reuse the same example) its a cache system. Basicly it
uses disk space. Guess what, it never goes FS full, if properly configured.

I just decide how much space i need, create a FS on a dedicated partition,
ensure i have enough RAM to keep whatever index squid uses in RAM (based
on experience), set a configurable parameter of squid (maximum disk
utilization on that partition) to a safe value (~90% of space).
The application itself takes care of not exeeding resource limits.
I put that on dedicated HW. Am i wasting CPU cycles? Yes, but knowing
that it's not a CPU intensive app., i choose the cheapest CPU on the market.

Is it critical? It is. Does it run unattended? Almost. Do I have random
users on the same system? Of course not.

On an university overcrowed multiuser system, i'd never install squid.
I'd never install any critical service. You don't have the money to
buy dedicated HW? Reconsider what "critical" means to you.

> > If this RAM shortage is caused by internal reasons (memory leakage in squid)
> > I need to fix the app. If RAM shortage is caused by external causes, such
> > as overload or DoS attempt, even restarting the server (or the whole
> > system) won't help.
>
> I may not be able to - I can report it to the vendor (ie oracle servers) but
> I still have to run them as-is. This is where the resource limits help, they
> are not a final solution (that takes fixing it).

Resource limits are useful. Full stop.

[...]

> > To effectively run a multi user system unattended you need much more the
> > user-level resource limits. And i feel that "critical" application deserve
> > dedicated hardware anyway.
>
> Oh yes. This is another brick in the support. Linux itself has shown that the
> foundation is VERY solid. This kernel works (nearly) forever when dedicated
> to a single function and with a bit of overkill in hardware (mostly memory).
>
> Part of my consideration here is for things like large beowulf clusters, the
> time when Linux runs on a 256 node Origin 2000, or other equivalent level
> systems (hey - when is it going to run on 1024 node SP system? When I asked
> IBM salesmen they just talked about linux not being reliable enough, and how
> I didn't want to run a non-production level OS on a $30-40M system... It almost
> sounded like he just didn't want to say "never").

Commercial systems being "production level" is a myth.
One key factor is user base. A $30M system will never be as solid as an
Open Source piece of software used by 30M people.

> When the "critical application" is weather/ocean modeling - that system will
> also be used to develop/extend the application. Very few people can afford
> two such systems - one for the application, and another for development.

Again, reconsider what "critical" means to you. I'd never advice to do
development on production systems. If you're short of money, you live
with it.

> -------------------------------------------------------------------------
> Jesse I Pollard, II
> Email: pollard@navo.hpc.mil
>
> Any opinions expressed are solely my own.
>

.TM.

-- 
      ____/  ____/   /
     /      /       /			Marco Colombo
    ___/  ___  /   /		      Technical Manager
   /          /   /			 ESI s.r.l.
 _____/ _____/  _/		       Colombo@ESI.it

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:38 EST