Re: /proc/$pid/pagemap troubles

From: Matt Mackall
Date: Tue Jul 31 2007 - 20:13:57 EST


On Tue, Jul 31, 2007 at 03:43:10PM -0700, Dave Hansen wrote:
> On Tue, 2007-07-31 at 16:37 -0500, Matt Mackall wrote:
> >
> > > So, a couple of questions. Don't we need to support
> > non-sizeof(unsigned
> > > long)-aligned reads?
> >
> > Why? We should obviously never return more data than we were asked for
> > (that's clearly a bug), but lots of things refuse to read or write
> > stuff that isn't well sized and aligned.
>
> For the 1-byte stuff at the beginning of the file, I think it makes
> perfect sense:
>
> char __big_endian;
> int big_endian;
> read(fd, &__big_endian, 1);
> big_endian = __big_endian;

True.

> > > Do we _really_ need that header in each and every file?
> >
> > Well there's either a header or there isn't.
>
> I just mean that perhaps we should be putting some of that stuff in a
> global file. I think there's some added simplicity both in the kernel
> and in the userspace programs if we just let "*ppos << PAGE_SHIFT ==
> vaddr".

Absolutely. It's the source of both of the long read bugs you
mentioned. But there are also advantages to having it attached.

> I was going to send you a patch, but since you're updating anyway, could
> you add some symbolic names like this:
>
> #define PAGEMAP_ENTRY_SIZE_BYTES sizeof(unsigned long)
> #define PAGEMAP_HEADER_SIZE_BYTES sizeof(unsigned long)
> #define PAGEMAP_MAX_VIRT_PFN (((~0UL) >> PAGE_SHIFT) + 1)
> #define PAGEMAP_BUFFER_SLOTS (PAGE_SIZE/PAGEMAP_ENTRY_SIZE_BYTES)
> #define PAGEMAP_UNPOPULATED_PAGE (-1UL)

I suppose.

> + evpfn = min((src + count) / sizeof(unsigned long),
> + ((~0UL) >> PAGE_SHIFT) + 1);
>
> Should that hunk of code be any different for 32-bit processes on a
> 64-bit kernel? Do we want to use TASK_SIZE_OF() or restrict these to
> actual virtual address space (which is only 48 bits on x86_64).

No. A page for a 32-bit task can line anywhere in the >32-bit physical
address space.

> It also seems that we need some replacement for the CONFIG_HIGHPTE
> #ifdefs. What is the basic reason for those? The copy_to_user() called
> by add_to_pagemap()->flush_pagemap() conflicts with the kmap_atomic()
> from pte_offset_map()?

My not-yet-working code uses get_user_pages and does away with all the
double-buffering and a bunch of the locking issues.

> As I think about it, do we really even need to walk the VMA list at all?
> We don't do anything differently if an address is inside a VMA. We just
> save time by not walking the pagetables for areas that we know are
> zeroed. But, we won't be able to walk down to the PTE level in most
> cases, anyway. So, it probably won't waste _that_ much time.

Perhaps not.

> +struct pagemapread {
> + struct mm_struct *mm;
> + unsigned long next;
> + unsigned long *buf;
> + pte_t *ptebuf;
> + unsigned long pos;
> + size_t count;
> + int index;
> + char __user *out;
> +};
>
> Just looking at the structure, it's _really_ hard to tell what fields go
> with other fields, and it took me a while to unravel it all. There's
> also some redundancy in it, which I find a bit confusing.

Most of those go away with the get_user_pages approach.

--
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/