On the one hand, my reverse page table building code is nearly complete,
though without shared page support (i.e. if it's shared, the reverse
mapping should be invalidated). I encoded the reverse entry as a struct
vm_area_struct *vma and an unsigned long vma_offset (defined to be
virtual address - vma->vm_start), both tacked onto struct page.
I added a consistency check to the existing page-by-page swapping
code in vmscan.c, and discovered that not only have I not found at
least one mechanism of page sharing (which doesn't necessarily
increment page->count!), but something else is making my offsets be
occasionally off by a factor of two (e.g. virtual address bffff000,
vma->vm_start=bfffd00, but vma_offset is only 1000, not 2000).
I don't know if I didn't go down one last order in my allocation
code, or I'm doing short pointer arithmetic somewhere, or some
other code is moving vma extents...
Having spent as much time as I have looking at the mm code, I can
at least say that there's only a handful of people worldwide who
understand it better than me. Alas, I think there's only a handful
of people worldwide who understand it in the first place...
> > > The binary tree method sounds promising. With it we could
> > > do true 'gravitational allocing' so the really large free
> > > areas will always gravitate to the bottom of physical memory.
> > Well, you got license to a solid, lean red-black tree implementation?
> > If so, drop it in! Same for a solid, lean AVL tree implementation.
> > If starting from scratch I'd prefer the red-black tree because
> > I know there are good algorithm descriptions out there; I haven't
> > seen good pseudo-code for AVL trees yet.
> I remember we used AVL trees for Linux VMA management, but
> since most (read: all but none) programs have only 2 or 3
> VMAs (with some heavy users at 6) it was heavier than needed.
> IIRC, this code has been in from '94 till '96, so it should
> be stable by now :-)
Yay. If this code won't do, I know Cormen, Leiserson and Rivest
has pseudocode (and Sedgewick's code is to be avoided); alas, there
ends my experience with algorithm books.
> > And since we'd have things taking off the rightmost and the leftmost,
> > but nobody wanting the "middle", we will definitely have to deal
> > with tree imbalance.
> It might be worth it to (temporarily) use the old AVL tree
> code with some interface (read: standard macro definitions)
> for it, so we can always improve it to something better, and
> it can be easily shared between different parts of the kernel.
Go for it. My hands are full right now...
> > > > Hmm. FWIW, without Zlatko's code but with Linus's and my fragment
> > > > reservation/fragment discrimination patches I've had no problems
> > > > with allocating, e.g. sound buffers after heavy memory consumption,
> > > Sound buffers don't get allocated that often that I would
> > > call it heavy...
> > No, I mean I did heavy memory consumption (using xv with the visual
> > schnauzer to browse directories with 500-1000+ images, then fired
> > up e.g. xanim on an .avi with sound, and the sound driver loaded
> > and allocated its buffer no problem.
> AHH. That's kinda heavy, yes. Then load Netscape with your own
> HTTP-proxy :-)
Netscape alone does quite a bit. Printing from netscape (through
ghostscript, LPRng which pipes, off to a remote spooler) when xv had
eaten all the memory queue_glue wanted is still a recipe for a hard
lockup for me...
> > Still, all the 16k free areas in the world don't guarantee the
> > existance of a 64k free area.
> 1. the 'new/old' allocation code (>.77) doesn't take pages from
> 64k free area's unless there are enough of them or the priority
> is high enough
> 2. you don't need a 64k sound buffer. even with heavy memory use,
> 16k usually does a fine job.
*shrug* It used to ask for a number and recommend 64k to me; I don't
know what it actually does now that it stopped asking for a number
at make config time, and don't know for certain that the buffer ever
needed to be contiguous...
> > With the reverse page table in
> > place, it will be child's play to create them by blitting (GFP_ATOMIC
> > low-order allocations) or intelligently chosen swapping (failed
> > allocations of all sorts). This sort of thing might help with
> > the sluggish paging I've seen, since swapping will be driven
> > by demand as well as a general desire to keep a few free pages
> > around "just in case of the inevitable".
> 3. you're absolutely right, but we can't put completely new
> code in 2.2, just some of the 'tried and tested' hacks
> we've been talking about lately.
Fine. I'll make my patches vs 2.1.latest-I-can-use until 2.3 rolls
around, and Linus doesn't have to roll them in before 2.2 unless he
wants to.
> 4. I really don't want to see any of those hacks in 2.3 !!!
> Once we start that series, we should add the reverse mapping
> and all the other goodies we need for 'real VM' :-)
A reverse mapping scheme which is useful 100% of the time is my
eventual goal as well. First to get something that always works
right but is only sometimes helpful, then to try to get some use
out of it...
Keith