Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

From: Michael Holzheu
Date: Mon Jul 08 2013 - 05:28:56 EST


On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke <d.hatayama@xxxxxxxxxxxxxx> wrote:

> (2013/07/02 4:32), Michael Holzheu wrote:
> > For zfcpdump we can't map the HSA storage because it is only available
> > via a read interface. Therefore, for the new vmcore mmap feature we have
> > introduce a new mechanism to create mappings on demand.
> >
> > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > that should be used to create mappings with remap_pfn_range() for oldmem
> > areas that can be directly mapped. For zfcpdump this is everything besides
> > of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > is called.
> >
>
> This fault handler is only for s390 specific issue. Other architectures don't need
> this for the time being.
>
> Also, from the same reason, I'm doing this review based on source code only.
> I cannot run the fault handler on meaningful system, which is currently s390 only.

You can test the code on other architectures if you do not map anything in advance.
For example you could just "return 0" in remap_oldmem_pfn_range():

/*
* Architectures may override this function to map oldmem
*/
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
unsigned long from, unsigned long pfn,
unsigned long size, pgprot_t prot)
{
return 0;
}

In that case for all pages the new mechanism would be used.

>
> I'm also concerned about the fault handler covers a full range of vmcore, which
> could hide some kind of mmap() bug that results in page fault.
>
> So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being.

I personally do not like that, but if Vivek and you prefer this, of course we
can do that.

What about something like:

#ifdef CONFIG_S390
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
}
#else
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
}
#endif

>
> > This handler works as follows:
> >
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> > Return that page
> > * If no:
> > Fill page using __vmcore_read(), set PageUptodate, and return page
> >
>
> It seems good to me on this page-in logic.
>
> <cut>
> > @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> > return acc;
> > }
> >
> > +static ssize_t read_vmcore(struct file *file, char __user *buffer,
> > + size_t buflen, loff_t *fpos)
> > +{
> > + return __read_vmcore(buffer, buflen, fpos, 1);
> > +}
> > +
> > +/*
> > + * The vmcore fault handler uses the page cache and fills data using the
> > + * standard __vmcore_read() function.
> > + */
>
> Could you describe usecase of this fault handler on s390?

ok.

>
> > +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct address_space *mapping = vma->vm_file->f_mapping;
> > + pgoff_t index = vmf->pgoff;
> > + struct page *page;
> > + loff_t src;
> > + char *buf;
> > + int rc;
> > +
>
> You should check where faulting address points to valid range.
> If the fault happens on invalid range, return VM_FAULT_SIGBUS.
>
> On s390 case, I think the range except for HSA should be thought of as invalid.
>

Hmmm, this would require another architecture backend interface. If we get a fault
for a page outside of the HSA this would be a programming error in our code. Not
sure if we should introduce new architecture interfaces just for sanity checks.

> > + page = find_or_create_page(mapping, index, GFP_KERNEL);
> > + if (!page)
> > + return VM_FAULT_OOM;
> > + if (!PageUptodate(page)) {
> > + src = index << PAGE_CACHE_SHIFT;
>
> src = (loff_t)index << PAGE_CACHE_SHIFT;
>
> loff_t has long long while index has unsigned long.

Ok.

> On s390 both might have the same byte length.

On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
and unsigned long 32 bit.

> Also I prefer offset to src, but this is minor suggestion.

Yes, I agree.

>
> > + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
>
> I found page_to_virt() macro.

Looks like page_to_virt() is not usable on most architectures and probably
pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
defined on s390.

But when I discussed your comment with Martin, we found out that the current

buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);

is not correct on x86. It should be:

buf = __va((page_to_pfn(page) << PAGE_SHIFT));

So would be the following patch ok for you?
---
fs/proc/vmcore.c | 94 +++++++++++++++++++++++++++++++++++++++++----
include/linux/crash_dump.h | 3 +
2 files changed, 89 insertions(+), 8 deletions(-)

--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
#include <linux/crash_dump.h>
#include <linux/list.h>
#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
#include <asm/uaccess.h>
#include <asm/io.h>
#include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha
return read_from_oldmem(buf, count, ppos, 0);
}

+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+ if (userbuf) {
+ if (copy_to_user(target, src, size))
+ return -EFAULT;
+ } else {
+ memcpy(target, src, size);
+ }
+ return 0;
+}
+
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
- size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+ int userbuf)
{
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
- if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+ if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *

tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
- if (copy_to_user(buffer, kaddr, tsz))
+ if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *
if (*fpos < m->offset + m->size) {
tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
start = m->paddr + *fpos - m->offset;
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
if (tmp < 0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,58 @@ static ssize_t read_vmcore(struct file *
return acc;
}

+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+ size_t buflen, loff_t *fpos)
+{
+ return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+#ifdef CONFIG_S390
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ *
+ * On s390 the fault handler is used for memory regions that can't be mapped
+ * directly with remap_pfn_range().
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t index = vmf->pgoff;
+ struct page *page;
+ loff_t offset;
+ char *buf;
+ int rc;
+
+ page = find_or_create_page(mapping, index, GFP_KERNEL);
+ if (!page)
+ return VM_FAULT_OOM;
+ if (!PageUptodate(page)) {
+ offset = (loff_t) index << PAGE_CACHE_SHIFT;
+ buf = __va((page_to_pfn(page) << PAGE_SHIFT));
+ rc = __read_vmcore(buf, PAGE_SIZE, &offset, 0);
+ if (rc < 0) {
+ unlock_page(page);
+ page_cache_release(page);
+ return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+ }
+ SetPageUptodate(page);
+ }
+ unlock_page(page);
+ vmf->page = page;
+ return 0;
+}
+#else
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ BUG();
+}
+#endif
+
+static const struct vm_operations_struct vmcore_mmap_ops = {
+ .fault = mmap_vmcore_fault,
+};
+
static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
@@ -242,6 +319,7 @@ static int mmap_vmcore(struct file *file

vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_ops = &vmcore_mmap_ops;

len = 0;

@@ -283,9 +361,9 @@ static int mmap_vmcore(struct file *file

tsz = min_t(size_t, m->offset + m->size - start, size);
paddr = m->paddr + start - m->offset;
- if (remap_pfn_range(vma, vma->vm_start + len,
- paddr >> PAGE_SHIFT, tsz,
- vma->vm_page_prot))
+ if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot))
goto fail;
size -= tsz;
start += tsz;
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -17,6 +17,9 @@ extern int __weak elfcorehdr_alloc(unsig
extern void __weak elfcorehdr_free(unsigned long long addr);
extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
+extern int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot);

extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
unsigned long, int);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/