Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

From: Andrea Arcangeli
Date: Fri Mar 05 2004 - 15:30:07 EST


On Fri, Mar 05, 2004 at 11:55:05AM -0800, Martin J. Bligh wrote:
> Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew
> worked on that a lot.

it should in every SLES8 kernel out there too (it wasn't in mainline
until very recently), see the related bhs stuff.

> In 2.6, I think the task struct is outside the kernel stack either way.
> Maybe you were pointing out something else? not sure.

I meant that making the kernel stack 4k pretty much requires removing
the task_struct, making it 4k w/o removing the task_struct sounds too
small.

> > The main thing you didn't mention is the overhead in the per-cpu data
> > structures, that alone generates an overhead of several dozen mbytes
> > only in the page allocator, without accounting the slab caches,
> > pagetable caches etc.. putting an high limit to the per-cpu caches
> > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > fine with 32G currently.
>
> Humpf. Do you have a hard figure on how much it actually is per cpu?

not a definitive one, but it's sure more than 2m per cpu, could be 3m
per cpu.

> > other relevant things are the fs stuff like file handles per task and
> > other pinned slab things.
>
> Yeah, that was a huge one we forgot ... sysfs. Particularly with large
> numbers of disks, IIRC, though other resources might generate similar
> issues.

which doesn't need to be mounted during production and hotplug should
mount read it and unmount. It's worthless to leave it mounted. Only
root-only hardware related stuff should be in sysfs, everything else
that has been abstracted at the kernel level (transparent to
applications) should remain in /proc. unmounting /proc hurts the
production systems, unmounting sysfs should not.

> You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger
> boxes will get progressively scarier ;-)

yes.

> What scares me more is that we can sit playing counting games all day,
> but there's always something we will forget. So I'm not keen on playing
> brinkmanship games with customers systems ;-)

this is true for 4:4 too. Also with 2.4 the system will return -ENOMEM,
not like 2.6 that lockup the box. so it's not a fatal thing if a certain
kernel can't sustain a certain workload in a certain hardware, just like
it's not a fatal thing if your run out of memory for the pagetables on a
64bit architecture with 64bit kernel. My only object is to make it feasible
to run the most high end workloads in the most high end hardware with a
good safety margin, knowing if something goes wrong the worst that can
happen is that a syscall returns -ENOMEM. There will always be a
malicious workload able to fill the zone-normal, if you fork off a tons
of tasks, and you open a gazzillon of sockets and you flood all of them
at the same time to fill all receive windows you'll fill your cool 4G
zone-normal of 4:4 in half a second with a 10gigabit NIC.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/