Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()

From: HATAYAMA Daisuke
Date: Fri Jul 12 2013 - 12:03:37 EST

Next message: Daniel Drake: "Re: [PATCH V3 1/3] dts: change Marvell prefix to 'marvell'"
Previous message: Jason Cooper: "Re: [PATCH V3 1/3] dts: change Marvell prefix to 'marvell'"
In reply to: Michael Holzheu: "Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()"
Next in thread: Martin Schwidefsky: "Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(2013/07/10 20:00), Michael Holzheu wrote:

On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke <d.hatayama@xxxxxxxxxxxxxx> wrote:

[snip]

(2013/07/10 17:42), Michael Holzheu wrote:
My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.

If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.

Also, could you tell me how often and how much the HSA region is during crash dumping?
I guess the read to HSA is done mainly during early part of crash dumping process only.
According to the code, it appears at most 64MiB only. Then, I feel performance is not
a big issue.

Currently it is 32 MiB and normally it is read only once.

Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
think it too much overhead...

I was more concerned about in_valid_fault_range(). But I was most concerned the additional
interface that introduces more complexity to the code. And that just to implement a
sanity check that in our opinion we don't really need.

And what makes it even worse:

What you think the sanity check is unnecessary is perfectly wrong. You design page faults
always happens on HSA region. If page fault happens on the other parts, i.e. some point
of mmap()ed region, it means somehow page table on the address has not been created. This
is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
now. There's no guarantee that program runs safely, of course for page cache creation, too.
We cannot and must expect such buggy process to behave in invalid states just as our design.
It results in undefined behaviour. The only thing we can do is to kill the buggy process
as soon as possible.

With the current patch series this check is only relevant for s390 :-)

So, at least for this patch series I would implement the fault handler as follows:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;

#ifndef CONFIG_S390
WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);

At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.

Could you explain the plan in more detail? Or I cannot review correctly since I don't
know whether there's really usecase of this generic fault handler for other
architectures.
This is the issue for architectures other than s390, not mine; now we
don't need it at all.

I would have preferred to do the things one after the other. Otherwise I fear
that this discussion will never come to an end. This patch series is needed
to get zfcpdump running with /proc/vmcore. And for that reason the patch series
makes sense for itself.

But FYI:

The other patch series deals with the problem that we have additional
information for s390 that we want to include in /proc/vmcore. We have a one
byte storage key per memory page. This storage keys are stored in the
s390 firmware and can be read using a s390 specific machine instruction.
We plan to put that information into the ELF notes section. For a 1 TiB
dump we will have 256 MiB storage keys.

Currently the notes section is preallocated in vmcore.c. Because we do
not want to preallocate so much memory we would like to read the notes
section on demand. Similar to the HSA memory for zfcpdump. To get this
work with your mmap code, we would then also use the fault handler to
get the notes sections on demand. Our current patch does this for all
notes and therefore also the other architectures would then use the
fault handler.

One advantage for the common code is that no additional memory has
to be allocated for the notes buffer.

The storage key patch series is currently not final because it depends
on the zfcpdump patch series.

Yes, on-demand memory allocation by fault handler is definitely suitable
for note sections throughout all architectures. But need of sanity check
is unchanged here. The issue is orthogonal. It's still needed just as
explained above.

BTW, sanity check for note sections is easy.

if (!(elfnote->p_offset <= offset ||
offset < elfnote->p_offset + elfnote->p_filesz))
return VM_FAULT_SIGBUS;

I assume elfnote has a pointer to a unique PT_NOTE program header table
in the ELF header buffer. The code would be the same on every
architecture.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Daniel Drake: "Re: [PATCH V3 1/3] dts: change Marvell prefix to 'marvell'"
Previous message: Jason Cooper: "Re: [PATCH V3 1/3] dts: change Marvell prefix to 'marvell'"
In reply to: Michael Holzheu: "Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()"
Next in thread: Martin Schwidefsky: "Re: [PATCH v6 3/5] vmcore: Introduce remap_oldmem_pfn_range()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]