Re: [patch][rfc] 5/5: core remove PageReserved

From: Nick Piggin
Date: Thu Jun 23 2005 - 05:44:26 EST


William Lee Irwin III wrote:
On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -337,7 +338,7 @@ static inline void get_page(struct page static inline void put_page(struct page *page)
{
- if (!PageReserved(page) && put_page_testzero(page))
+ if (put_page_testzero(page))
__page_cache_release(page);
}


No sweep before this? I'm not so sure.


Sweep what? There is a warning that will get triggered in the case
we free a PageReserved page through this path.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -514,7 +534,8 @@ int copy_page_range(struct mm_struct *ds
return 0;
}

-static void zap_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+static void zap_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{


As exciting as this is, !!(vma->vm_flags & VM_RESERVED) could trivially
go into struct zap_details without excess args or diff.


Actually, it isn't trivial. I thought of trying that.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -1225,6 +1251,8 @@ static int do_wp_page(struct mm_struct *
unsigned long pfn = pte_pfn(pte);
pte_t entry;

+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
if (unlikely(!pfn_valid(pfn))) {
/*
* This should really halt the system so it can be debugged or


!pfn_valid(pfn) is banned when !(vma->vm_flags & VM_RESERVED); here we
BUG_ON the precondition then handle !pfn_valid(pfn) in the old manner
where some new infrastructure has been erected for reporting such errors.


No, it uses print_invalid_pfn too.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -1259,13 +1286,16 @@ static int do_wp_page(struct mm_struct *
/*
* Ok, we need to copy. Oh, well..
*/
- if (!PageReserved(old_page))
+ if (old_page == ZERO_PAGE(address))
+ old_page = NULL;
+ else
page_cache_get(old_page);
+
spin_unlock(&mm->page_table_lock);

if (unlikely(anon_vma_prepare(vma)))
goto no_new_page;
- if (old_page == ZERO_PAGE(address)) {
+ if (old_page == NULL) {
new_page = alloc_zeroed_user_highpage(vma, address);
if (!new_page)
goto no_new_page;


There are some micro-optimizations mixed in with this and some
subsequent do_wp_page() alterations.


We no longer page_cache_get ZERO_PAGE, ever. That I changed it so
in a more optimal way than it was is surely acceptable?


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -1654,7 +1656,7 @@ void __init memmap_init_zone(unsigned lo

for (page = start; page < (start + size); page++) {
set_page_zone(page, NODEZONE(nid, zone));
- set_page_count(page, 0);
+ set_page_count(page, 1);
reset_page_mapcount(page);
SetPageReserved(page);
INIT_LIST_HEAD(&page->lru);


The refcounting and PG_reserved activity in memmap_init_zone() is
superfluous. bootmem.c does all the necessary accounting internally.


Well not superfluous yet. It is kept around to support all the arch
code that still uses it (not much, mainly reserved memory reporting).


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c
+++ linux-2.6/mm/fremap.c
@@ -29,18 +29,21 @@ static inline void zap_pte(struct mm_str
return;
if (pte_present(pte)) {
unsigned long pfn = pte_pfn(pte);
+ struct page *page;

flush_cache_page(vma, addr, pfn);
pte = ptep_clear_flush(vma, addr, ptep);
- if (pfn_valid(pfn)) {
- struct page *page = pfn_to_page(pfn);
- if (!PageReserved(page)) {
- if (pte_dirty(pte))
- set_page_dirty(page);
- page_remove_rmap(page);
- page_cache_release(page);
- dec_mm_counter(mm, rss);
- }
+ if (unlikely(!pfn_valid(pfn))) {
+ print_invalid_pfn("zap_pte", pte, vma->vm_flags, addr);
+ return;
+ }
+ page = pfn_to_page(pfn);
+ if (page != ZERO_PAGE(addr)) {
+ if (pte_dirty(pte))
+ set_page_dirty(page);
+ page_remove_rmap(page);
+ dec_mm_counter(mm, rss);
+ page_cache_release(page);
}
} else {
if (!pte_file(pte))


There is no error returned here to be handled by the caller.


That's OK, the pte has been cleared. Nothing else we can do.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -65,6 +68,8 @@ int install_page(struct mm_struct *mm, s
pgd_t *pgd;
pte_t pte_val;

+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);

@@ -122,6 +127,8 @@ int install_file_pte(struct mm_struct *m
pgd_t *pgd;
pte_t pte_val;

+ BUG_ON(vma->vm_flags & VM_RESERVED);
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);


This has no effect but to artificially constrain the interface. There
is no a priori reason to avoid the use of install_page() to deposit
mappings to normal pages in VM_RESERVED vmas. It's only the reverse
being "banned" here. Similar comments also apply to zap_pte().


No, install_page is playing with the page (eg. page_add_file_rmap)
which is explicity banned even before my PageReserved removal. It
is unclear that this ever safely worked for normal pages, and will
hit NULL page dereferences if trying to do it with iomem.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/drivers/scsi/sg.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sg.c
+++ linux-2.6/drivers/scsi/sg.c
@@ -1887,9 +1887,10 @@ st_unmap_user_pages(struct scatterlist *
int i;

for (i=0; i < nr_pages; i++) {
- if (dirtied && !PageReserved(sgl[i].page))
+ if (dirtied)
SetPageDirty(sgl[i].page);
/* unlock_page(sgl[i].page); */
+ /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
/* FIXME: cache flush missing for rw==READ
* FIXME: call the correct reference counting function
*/


An answer should be devised for this. My numerous SCSI CD-ROM devices
(I have 5 across several different machines of several different arches)
are rather unlikely to be happy with /* FIXME: XXX ... as an answer.


The worst it will do is dirty a VM_RESERVED page. So it is going
to work unless you're thinking about doing something crazy like
mmap /dev/mem and send your CDROM some pages from there. But yeah,
I have to find someone who knows what they're doing to look at this.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/drivers/scsi/st.c
===================================================================
--- linux-2.6.orig/drivers/scsi/st.c
+++ linux-2.6/drivers/scsi/st.c
@@ -4435,8 +4435,9 @@ static int sgl_unmap_user_pages(struct s
int i;

for (i=0; i < nr_pages; i++) {
- if (dirtied && !PageReserved(sgl[i].page))
+ if (dirtied)
SetPageDirty(sgl[i].page);
+ /* FIXME: XXX don't dirty/unmap VM_RESERVED regions? */
/* FIXME: cache flush missing for rw==READ
* FIXME: call the correct reference counting function
*/


Mutatis mutandis for my SCSI tape drive.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/sound/core/pcm_native.c
===================================================================
--- linux-2.6.orig/sound/core/pcm_native.c
+++ linux-2.6/sound/core/pcm_native.c
@@ -2942,8 +2942,7 @@ static struct page * snd_pcm_mmap_status
return NOPAGE_OOM;
runtime = substream->runtime;
page = virt_to_page(runtime->status);
- if (!PageReserved(page))
- get_page(page);
+ get_page(page);
if (type)
*type = VM_FAULT_MINOR;
return page;


snd_malloc_pages() marks all pages it allocates PG_reserved. This
merits some commentary, and likely the removal of the superfluous
PG_reserved usage.


Sure, but not in this patch. The aim here is just to eliminate special
casing of refcounting. Other PG_reserved usage can stay around for the
moment (and is actually good for catching errors).


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -1073,6 +1073,17 @@ munmap_back:
error = file->f_op->mmap(file, vma);
if (error)
goto unmap_and_free_vma;
+ if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
+ == (VM_WRITE | VM_RESERVED)) {
+ printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
+ "PROT_WRITE mmap of VM_RESERVED memory, which "
+ "is deprecated. Please report this to "
+ "linux-kernel@xxxxxxxxxxxxxxx\n",current->comm);
+ if (vma->vm_ops && vma->vm_ops->close)
+ vma->vm_ops->close(vma);
+ error = -EACCES;
+ goto unmap_and_free_vma;
+ }
} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)


This is user-triggerable where the driver honors mmap() protection
flags blindly.


If the user is allowed write access to VM_RESERVED memory, then I
suspect there is a lot worse they can do than flood the log.

But the check isn't going to stay around forever.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/mm/bootmem.c
===================================================================
--- linux-2.6.orig/mm/bootmem.c
+++ linux-2.6/mm/bootmem.c
@@ -286,6 +286,7 @@ static unsigned long __init free_all_boo
if (j + 16 < BITS_PER_LONG)
prefetchw(page + j + 16);
__ClearPageReserved(page + j);
+ set_page_count(page + j, 0);
}
__free_pages(page, order);
i += BITS_PER_LONG;


ibid re: bootmem


Not yet.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

Index: linux-2.6/kernel/power/swsusp.c
===================================================================
--- linux-2.6.orig/kernel/power/swsusp.c
+++ linux-2.6/kernel/power/swsusp.c
@@ -433,15 +433,12 @@ static int save_highmem_zone(struct zone
continue;
page = pfn_to_page(pfn);
/*
+ * PageReserved(page) -
* This condition results from rvmalloc() sans vmalloc_32()
* and architectural memory reservations. This should be
* corrected eventually when the cases giving rise to this
* are better understood.
*/
- if (PageReserved(page)) {
- printk("highmem reserved page?!\n");
- continue;
- }
BUG_ON(PageNosave(page));
if (PageNosaveFree(page))
continue;


This behavioral change needs to be commented on. There are some additional
difficulties when memory holes are unintentionally covered by mem_map[];
It is beneficial otherwise. It's likely to triplefault on such holes.


It seems the author of this code themselves didn't really understand
what was going on here, so I'm buggered if I can be bothered :)

Remember though, PageReserved can stay around for as long as we need,
so this hunk can be trivially reverted if it is an immediate problem.


On Thu, Jun 23, 2005 at 05:08:24PM +1000, Nick Piggin wrote:

@@ -527,13 +524,8 @@ static int saveable(struct zone * zone, return 0;

page = pfn_to_page(pfn);
- BUG_ON(PageReserved(page) && PageNosave(page));
if (PageNosave(page))
return 0;
- if (PageReserved(page) && pfn_is_nosave(pfn)) {
- pr_debug("[nosave pfn 0x%lx]", pfn);
- return 0;
- }
if (PageNosaveFree(page))
return 0;


The pfn_is_nosave() check must stand barring justification of why
unintentionally saving (and hence restoring) the page is tolerable.


I thought the PageNosave should catch that, and the next line is
just a debug check. But I'm looking for someone who *actually* knows
how swsusp works, if anyone would like to volunteer :)

Thanks very much for the review!

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/