Re: Linux and physical memory

Stephen C. Tweedie (sct@redhat.com)
Thu, 21 Jan 1999 14:25:51 GMT


Hi,

On Mon, 18 Jan 1999 20:51:23 -0800, "David S. Miller" <davem@dm.cobaltmicro.com> said:

> [ CC:'d Ingo and Stephen, they might actually implement this, hint
> hint :-) ]

We plan to. :)

> From: Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz>
> Date: Tue, 19 Jan 1999 00:43:33 +0100 (CET)

> You cannot simply make a difference between user and other pages.
> Think about e.g. COW. Which page would it access through user
> address space? The source page or the destination? One would have
> to be mapped in kernel space (Linux COW uses both source and
> destination here within the kernel mapping of all virtual memory).

Yes, but the page does not have to be in the kernel's mapping _all_ of
the time.

The way I'm currently leaning towards is to reserve something like 100MB
of the kernel's virtual address space for dynamic VA mappings on Intel,
hooked on a simple LRU list. For every physical page, add a page->kva
pointer containing that page's current kernel virtual address, and use a
page_kva() macro to find the virtual address for any given struct page.
It is relatively straightforward to recycle the mappings in that 100MB
region to make sure that all pages the kernel is currently using have a
valid kva, and that we don't have to update the va mappings for every
single kernel access.

This works pretty well because the number of times we have to access
user memory via such a mechanism is pretty small. ptrace() is the
obvious one. COW is easy: when we take a COW fault, the existing data
is already in the faulting process's VA, so we just copy it out of the
existing mm context and into the newly allocate page (we may need to
kva-map the new page, of course). Similarly, read/write just need to
kva-map the page cache page and copy to/from the current process's user
address space.

> So roughly for all non-device driver stuff the implementation on Intel
> could be:

> 1) Add a per-page flag PG_highmem
> 2) At boot time set per-page flags in this way based upon what
> where in physical memory the page is and what PAGE_OFFSET the
> kernel is using.

I'd prefer just to fix the PAGE_OFFSET in this model, and specify 1GB
kernel VA at all times. 800MB can be physically mapped, non-highmem
pages; 100MB can give us a decent size of kva mapping; and the remainder
is free for vmalloc. Of course, we'll have to use 4k page tables for
the 100MB kva VA, which kind of sucks, but it's better than remapping
for every single highmem access.

> 3) Add some mechanism to tell __get_free_pages() that PG_highmem
> pages are OK to use.
> 4) Fix do_no_page and do_wp_page code paths to allocate anonymous
> pages allowing the special PG_highmem type.

Add the page cache to that. Page cache is no less or more difficult
than anonymous pages, since in both cases we need to be able to support
(or give the illusion of) direct IO into highmem (think mmap and/or
swapping). For pages above 4G, that illusion will have to be achieved
transparently via bounce buffers in ll_rw_block.

> 5) Implement __copy_high_page() and __clear_high_page() for x86,
> by picking 2 unused virtual addresses in kernel space and making
> temporary mappings on the local processor during the copy/clear
> page operation. This may get tricky because of how swapper_pg_dir
> is shared amongst processors, but actually since the local task
> has it's own page table it should work _iff_ all copy/clear page
> operations happen in such a context during page faults (I think
> they do).

Much better to use the page_kva() style, and make the temporary mapping
persistent in the short to medium term with LRU recycling of kvas. And
yes, notification between CPUs is the tricky part. The best mechanism
may well be architecture-specific. One thought was to have a kva MM
sequence number on each CPU, so that any time we do a global
invalidate() we can increase the local sequence number and assume all
our local tlbs are uptodate; but until that time, we can still detect
kvas which may not be locally uptodate. That will at least allow us to
defer the SMP-global page invalidation until we detect a cross-CPU
conflict, so that the common case is fast: only if several CPUs need to
access the same kva without an intervening context switch do we
invalidate everywhere.

> 1) Some bits of code want to coalesce/uncoalesce buffer head buffers
> to/from pages by copying them one by one into/from the page. I
> haven't checked all the places which do this, so I have no
> suggestion about how to handle this.

The buffer cache itself probably needs to stay in lowmem, because of the
amount of filesystem code which uses it for metadata. Page cache IO
buffers are special temporary buffers, and it is easy enough to lock
them into kva while that IO is in progress: the buffer_heads will never
persist after the IO.

The SMP invalidation is probably the most interesting part of the whole
thing, in that there are lots of ways of doing it and it is not
particularly obvious up front what The Best way is. The rest of the
implementation should not be that complex.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/