Re: Linux and physical memory

David S. Miller (davem@dm.cobaltmicro.com)
Mon, 18 Jan 1999 20:51:23 -0800


From: Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz>
Date: Tue, 19 Jan 1999 00:43:33 +0100 (CET)

[ CC:'d Ingo and Stephen, they might actually implement this, hint
hint :-) ]

You cannot simply make a difference between user and other pages. Think
about e.g. COW. Which page would it access through user address space?
The source page or the destination? One would have to be mapped in kernel
space (Linux COW uses both source and destination here within the kernel
mapping of all virtual memory).

Jakub, you forget one important thing, and something we have been
doing on sparc64 for some time now (not to support a lot of memory,
but for performance, to eliminate some TLB thrashing).

Anonymous pages aren't an issue at all if you can directly and
efficiently load/unload TLB mappings in software like we can on
sparc64. In fact, for sparc64, it is more efficient to (and we do):

1) pick two (unlocked) TLB entries, save their contents
2) load the two (one for clear_page) pages into the TLB at some unused
2 virtual kernel pages dedicated for this purpose
3) perform the bzero/bcopy of the page(s)
4) restore TLB entries saved away in #1

I only see some strange x86 mechanisms to mess with the TLB directly,
and I bet they are both slow and not well tested (and furthermore,
probably not well emulated on x86 clone chips). So this isn't a real
option because it'll be a performance hit.

But if someone wanted >2GB physical memory and wanted to use the pages
for real anonymous user pages on Intel, at least it is possible but
potentially very slow.

[ Keep in mind, the following is a big hack, and not something I'd
think would get into a standard kernel ever. ]

So roughly for all non-device driver stuff the implementation on Intel
could be:

1) Add a per-page flag PG_highmem
2) At boot time set per-page flags in this way based upon what
where in physical memory the page is and what PAGE_OFFSET the
kernel is using.
3) Add some mechanism to tell __get_free_pages() that PG_highmem
pages are OK to use.
4) Fix do_no_page and do_wp_page code paths to allocate anonymous
pages allowing the special PG_highmem type.
5) Implement __copy_high_page() and __clear_high_page() for x86,
by picking 2 unused virtual addresses in kernel space and making
temporary mappings on the local processor during the copy/clear
page operation. This may get tricky because of how swapper_pg_dir
is shared amongst processors, but actually since the local task
has it's own page table it should work _iff_ all copy/clear page
operations happen in such a context during page faults (I think
they do).
6) Change copy_page()/clear_page() macros on x86 to use the new
routines if one of the pages are PG_highmem.

This should "just work" besides the x86 page table details necessary
to access the higher physical pages.

Now if you get this right, you can then get adventurous and try to
make the page cache use PG_highmem too. The problems here are:

1) Some bits of code want to coalesce/uncoalesce buffer head buffers
to/from pages by copying them one by one into/from the page. I
haven't checked all the places which do this, so I have no
suggestion about how to handle this.
2) A new __copy_to_user_highmem() will be needed during reads out
of the page cache to userspace.
3) Set host->has_unchecked_isa_dma in all PCI scsi drivers used, and
set ISA_DMA_THRESHOLD in asm-i386/scatterlist.h appropriately :-)
Actually, there may be another way to handle this, especially on
64-bit PCI busses with devices which speak it, but this is a quick
way to make it work.

I wouldn't want to code and support this, but I suppose if someone out
there had such a big bad ass Intel system, they can afford to pay one
of us to do it for them if they can afford such a machine in the first
place :-)

Just a brain dump...

Later,
David S. Miller
davem@dm.cobaltmicro.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/