2.6.0-test11-wli-2

From: William Lee Irwin III
Date: Thu Dec 11 2003 - 00:30:53 EST


Successfully tested on a Thinkpad T21 and 32GB NUMA-Q. Any feedback
regarding performance would be very helpful. Desktop users should
notice top(1) is faster, kernel hackers that kernel compiles are faster,
and highmem users should see much less per-process lowmem overhead
(but nothing truly radical like pgcl or 4/4).

This release features 4KB stack fixes for task_pt_regs(), and is vastly
cleaned up compared to the test8 release. It also features the wchan
and major/minor fault accounting fixes. I still need to fold the
anobjrmap fixes into the original patches and find some alternative way
for smbfs and ncpfs to do whatever d_validate() was doing for them (they
won't compile until I do; for now, they depend on CONFIG_BROKEN). I
also need to find a coherent way to incorporate the cleanups for pte
caching suggested by akpm without bloating the pte cache on small boxen.

It's also worth noting that the patches "originally due" to someone
were just plucked off of the list by me (in some cases, they just first
implemented of the idea), so be sure to ascribe all the credit to
them, and all the bugs to me. I wasn't sure of a complete (or even
partial) list of those to credit for top-down vma allocation, hence
"various", though it was jejb who described the glibc workaround.

Available from:
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/kernels/2.6.0-test11/patch-2.6.0-test11-wli-2.gz

336 files changed, 4668 insertions(+), 3054 deletions(-)


-- wli

1: highpmd
-- originally due to aa as part of pte-highmem
-- from scratch implementation based on 2.6 highpte
Shove middle level pagetables into highmem PAE process
scalability, for a lowmem savings of 20KB/process. This
is a large fraction of the per-process lowmem overhead.

2: O(1) buffered_rmqueue()
Make zone->lock hold time independent of batch counts
with a revised page allocator algorithm.

3: i386 pte caching
Highpte-compatible implementation of leaf pagetable
node caching. Should eliminate pte_alloc_one() from
profiles as well as conserve cache.

4: O(lg(n)) proc_pid_readdir()/proc_task_readdir()
Make /proc/ task enumeration scalable by organizing
lists as rbtrees for O(lg(n)) list seeks. To see how
effective this is, try doing the following:
(a) start up top(1)
(b) for n in `seq 1 1024`; do sleep 60; done
... and watch -wli whale on mainline with vastly
reduced cpu consumption.

5: O(1) proc_pid_statm()
-- originally due to bcrl
proc_pid_statm() involves a vma list walk for every
process on the system. This patch keeps count of the
statistics instead for extreme process scalability:
as they're integers, no per-process locks are required.

6: kmap_atomic() microoptimizations
Trivial microoptimization that have been in RHAS for years.

7: compile-time mapping->page_lock type selection
Use rwlocks for mapping->page_lock on larger systems for
greater concurrency. Nice improvement for SDET.

8: 4KB i386 stacks
-- originally due to bcrl
Even more PAE process scalability: cut per-process lowmem
overhead from kernel stacks in half.

9: objrmap
-- originally due to dmc/mbligh
Teach the VM about how to use accounting done for file-
backed mmap() to avoid creating some pte_chains.

10: RCU mapping->i_shared_lock
Yes, the first VM RCU patch.
mapping->i_shared_sem was a contention point with some
catastrophic contention overheads. This goes all the way
and RCU's it.

11: anobjrmap, phase 1
-- originally due to hugh
Prepare the VM to change how some of the fields of struct
page are interpreted.

12: anobjrmap, phase 2
-- originally due to hugh
Remove pte_chains and temporarily disable swapout.

13: anobjrmap, phase 3
-- originally due to hugh
Set up new accounting structures for anonymous pages to
perform PTE shootdowns in an analogous way to file-backed.

14: anobjrmap, phase 4
-- originally due to hugh
Teach the VM how to handle remap_file_pages() and mremap()
and reinstate paging of anonymous memory.

15: anobjrmap, phase 5
-- originally due to hugh
Micro-optimize the new rmap functions.

11-15 are also known as "full object-based rmap" or "anonobjrmap"
11-15 mean pte_chains are *NEVER* created, *EVER*, except for two
rare instances:
(1) remap_file_pages()
(2) a rare (nonfatal) race during an in-flight mremap()
that causes pages to appear at two virtual
addresses simultaneously.

16: RCU anon->lock
This RCU's the new anobjrmap accounting structures
analogous to mapping->i_mmap and mapping->i_shared_lock.
VM RCU patch #2.

17: convert copy_strings() to use kmap_atomic() instead of kmap()
Kill off one more instance of the sleeping kmap() and
so rip out one more acquisition of the global kmap_lock
and potential global TLB flush.

18: increase page allocator batch counts
O(1) buffered_rmqueue() makes the algorithm scale with
large batches. This jacks up the batch sizes to exploit
that newfound scalability.

19: node-local i386 per_cpu areas
Bootmem allocation of per_cpu areas implies lowmem
allocation, which furthermore implies they're all
trapped on node 0. This patch utilizes the in-mainline
arch code hook to perform the necessary remapping so
that the per_cpu area belonging to a cpu is node-local.

20: CONFIG_MMAP_TOPDOWN, top-down vma allocation
-- various prior implementations
This compacts i386 virtualspace for reduced pagetable
space usage as well as enabling larger heaps and mmap()'s.
2.9GB mmap()'s and heaps are feasible in a 3:1 split.

21: anobjrmap fixes
Fix some bogons in the anobjrmap code.

22: increase static vfs hashtable and VM array sizes
Certain hashtables in the VM and vfs are statically
sized and worse yet microscopic. This jacks up their
sizes substantially for increased performance with
little code or space impact.

23: intermezzo 4KB stack fixes
-- due to muli
This fixes obscene stack usage by the intermezzo fs.

24: /proc/ BKL gunk plus page wait hashtable sizing adjustment
This nukes some unnecessary BKL acquisitions in /proc/
code and increases page wait hashtable size for better
waitqueue hashing scalability.

25: invalidate_inodes() speedup
-- originally due to Kirill Korotaev
This makes umount(8) less of a catastrophe, particularly
in automount setups typical with academic NFS setups.

26: fix the anobjrmap fix and adjust page_alloc.c inlining
Minor tweaks for the page allocator, and another
anobjrmap bugfix.

27: remove ->valid_addr_bitmap, kern_addr_valid(), and d_validate()
The pgdat->valid_addr_bitmap is unused everywhere. This
removes it and so likely saves a small amount of lowmem,
and nukes d_validate() as collateral damage. If you
realize what d_validate() does, you'll see why. (It tries
to "validate" some random address grabbed off the wire
or some other random place as a dcache entry.)

28: fix wchan accounting
Scheduling internals are still reported as blocking
points by get_wchan(); this uses an ELF section to mark
scheduling internals in a better way than current.

29: fix major/minor fault accounting
The major and minor fault statistics reported by
getrusage(2) are completely bogus; this patch corrects
the kernel's accounting so they are meaningful.

30: remove mm->swap_address
Remove an unused field of the process address space
structure.

31: 4KB stack fixes
Fix an oops and oops reporting.

32: patch-2.6.0-test11-bk8
Linus' current bk snapshot, featuring various
critical bugfixes for tmpfs, mmap(), and others.

33: EXTRAVERSION
Let people know this kernel is -wli.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/