Re: 2.1.76, nfs client, and memory fragmentation

Linus Torvalds (torvalds@transmeta.com)
Fri, 2 Jan 1998 12:09:45 -0800 (PST)


On Fri, 2 Jan 1998 kwrohrer@enteract.com wrote:

> And lo, Linus Torvalds saith unto me:
> >
> > Umm? And what do you do after a fork, when the same physical page is
> > present in multiple page tables?
>
> Until I figure out what to do with such a page, I can't move it. However,
> I can do what the existing VM routines already do: slate it for asynchronous
> swapping (which will decrement the usage count), and go on to the next
> likely-looking candidate. Of course, to do that, my reverse page table
> swells bigger than the forward page table, so at this point I'll just
> give up and try to grow a different free area...

The reverse page table can actually be made to be the same size as the
forward page table fairly easily. We can essentially make the pte entries
contain a next pointer (you don't want to do it exactly that way, because
you want the pte to match the hardware layout if possible, but
conceptually you can think of it as just making the pte larger).

And I generally don't worry all that much about the size of the page
tables. However, doubling them still makes me nervous.

Note that the normal VM routines can handle COW-shared pages fairly
efficiently the way they work now: forgetting them one at a time until
only the page cache one is left sounds like the chicken way out of a hard
problem, but it actually acts as a simple aging agent at the same time as
it allows us to not keep exact track of where all physical pages are
mapped, so for the normal page-out routines this is a perfectly acceptable
situation.

It's only when you want a _specific_ physical page that the lack of
reverse mapping is painful. That does happen with shared non-COW pages
occasionally (paging them out would be complex), but UNIX semantics tends
to make it fairly easy - the only shared non-COW pages that exist have a
well-specified backing store that all processes can agree about, so there
is no ambiguity about where on the disk a page should be. It does result
in potentially unnecessary page-outs (when multiple processes have the
same page dirty), but it's a pretty rare condition.

(But thist is what makes anonymous shared mapping so hard, and why Linux
doesn't support them other than trough the SysV shared memory shm
interface - which _does_ keep a reverse mapping internally).

So yes, reverse page tables would be useful, but it's fairly painful and
the gain is usually not all that large except for this one case. And this
one case could probably be solved other ways (I _still_ think that the
slab allocator is wrong in being so aggressive about getting large areas,
so it's not just a page-allocator problem, it's also a _usage_ problem).

> > But in order to keep track of shared pages (whether they are truly shared
> > or just COW-shared) you need to essentially have a shadow page table which
> > contains the linked list of virtual mappings..
>
> You seem to do fine without one. :-P I don't see the existing code making
> any effort to, e.g., kill all processes which were using a shared page when
> it encountered an error during swapping. :-) (I take it there's a reason
> to assume whatever failure ruined the original copy of the page? I'd
> expect transient failures to be more common...)

Right. That's my point. We _can_ generally do fine without any reverse
mapping at all, because most of the time it makes no difference. But if we
were to need a reverse mapping, then we need one that can handle page
sharing.

One approach is to have just a "hint-map", which might indeed be good
enough. The hint-map would contain entries only for private un-shared
pages - which is often a large fraction of the pages. HOWEVER, I suspect
that the UNIX fork()/exec() semantics would make even this be impractical:
on average a lot of pages are not shared, but most pages are shared at
least for a short while every once in a while. And even if they become
unshared quickly after being shared, they will have lost the information
about what single mapping they had.

(So you could instead try to keep a hint map that has a pointer for each
physical page, but the pointer might point to just one of multiple users,
and the code that wants to free the page would have to check whether it
actually got the exclusive access. It might well be "good enough" for what
we're interested in - we couldn't free an arbitrary page, but we would
have a reasonable chance of freeing some pages we're interested in).

> Unfortunately, it seems like I need a heck of a lot more junk than
> just the page table entry to evict a page under Linux. Ignoring the
> shared page issue for now, is frobbing the pte (to change the physical
> address), plus (presumably) flushing the TLB by the time I'm done, safe?
> Or do I need to go galumphing through the plethora of structs the
> swap_out_I_really_mean_it_this_time() functions pass around?

It's not all that bad, but you do need the vma associated with the pte -
otherwise you wouldn't know where to swap it out to (it might have backing
store, and the vma might actually tell you that you mustn't swap it out at
all because the area is locked etc). But with the vma and the virtual
address you should have all the info you need.

So it should be entirely possible to add a (vma, offset) tuple into the
"struct page". In fact, if you limit the scheme to only non-shared
anonymous pages you wouldn't even need to grow the "struct page" at all:
you could overlay the information with the current (inode, offset)
information (or hash) that is needed for the page cache (but as this would
be an anonymous page you know that it cannot be in the page cache).

I'd certainly like a scheme that would need memory only proportional to
the physical memory space - even if the scheme would be less than perfect
in some regards. But some things that want the reverse mapping do want it
expressly for the case of shared pages, so the question then becomes
whether you want to make a more limited scheme that works well enough for
the page allocator but not some other cases, or whether you really want to
go the whole way..

Linus