Re: [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory.

From: Ingo Molnar
Date: Sat Aug 24 2019 - 12:14:40 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, Aug 23, 2019 at 2:16 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> >
> > > + */
> > > + if (fatal_signal_pending(current)) {
> > > + if (!(error_code & X86_PF_USER))
> > > + no_context(regs, error_code, address, 0, 0);
> >
> > Since the 'signal' parameter to no_context() is 0, will there be another
> > signal generated? I don't think so - so the comment looks wrong to me,
> > unless I'm missing something.
>
> The old case only handled the kernel case - which never added a signal
> at all, it just failed the access (and results in EFAULT or similar,
> but nobody cares since the whole point is that the process is going to
> be killed).
>
> The *changed* case handles user space accesses too - by just returning
> without any new signal being generated. The old case would fall
> through to the "generate SIGSEGV", which seems pointless and wrong
> (and also possibly generates misleading messages in the kernel logs).

Indeed!

> > Other than that, what we are skipping here if a KILL signal is pending is
> > the printout of oops information if the fault is a kernel one.
> >
> > Not sure that's a good idea in general: carefully timing a KILL signal
> > would allow the 'stealth probing' of otherwise oops generating addresses?
>
> That sounds really like a non-issue to me. Plus the old code ALREADY
> did that exact thing. The only _new_ case is that it does is silently
> for user page faults.
>
> > I.e. I'm not sure this hunk is necessary or even a good idea.
>
> As mentioned, the new code doesn't change the case you are talking about at all.
>
> The new code only changes the case of this happening from user space,
> when it doesn't generate a pointless signal and a pointless possible
> show_signal_msg() garbage for a user mode access that was denied due
> to the new
>
> > > + if (flags & FAULT_FLAG_KILLABLE) {
> > > + if (fatal_signal_pending(current))
> > > + return VM_FAULT_SIGSEGV;
> > > + }
>
> code in handle_mm_fault().
>
> And you said that new code looks fine to you, but you didn't seem to
> realize that it causes stupid incorrect kernel messages to be printed
> ("segfault at xyz" etc) that are completely wrong.
>
> Because it's not a "real" SIGSEGV. It gets repressed by the fact that
> there's a fatal signal pending.
>
> Otherwise we'd have to add a whole new case of VM_FAULT_xyz..

Indeed :-/

At least I can report that I was able to reproduce a probable variant of
the bug Tetsuo reported on a regular PC as well, with this command:

dd if=/dev/mem of=/dev/null status=progress conv=noerror bs=100M

The 'noerror' is needed to skip unreadable holes early in the memory map.

What I noticed is that while reading regular RAM is reasonably fast even
in very large chunks, accesses become very slow once they hit iomem - and
delays even longer than the reported 145 seconds become possible if
bs=100M is increased to say bs=1000M.

With your patch applied these large read chunks are still possible and
not interruptible within read_mem(), and because the source of the
slowdown is the iomem access and memcpy()s, not the paging overhead, the
new page fault code doesn't even trigger and the delays remain.

I.e. this hypothesis is not true, on my testbox at least - and with my
different testcase:

> > > Also, if it takes minutes to delay killing things, that implies
> > > that we're probably still faulting in pages for the read_mem().
> > > Which points to another possible thing we could do in general: just
> > > don't bother to handle page faults when a fatal signal is pending.

With Tetsuo's approach the delays are fixed, because the fatal signal is
noticed within the 4K chunks of read_mem() immediately:

[pid 1674] 1566662105.052485 write(1, "ZZZZ\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 104857600) = 104857600 <0.000008>
[pid 1674] 1566662105.052512 read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 104857600) = 104857600 <17.006740>
[pid 1674] 1566662122.059306 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 104857600) = 104857600 <0.000009>
) = 1 <0.000010>662122.059334 write(2, "\r", 1
[pid 1674] 1566662122.059372 write(2, "3145728000 bytes (3.1 GB) copied", 323145728000 bytes (3.1 GB) copied) = 32 <0.000009>
[pid 1674] 1566662122.059411 write(2, ", 18.214699 s, 173 MB/s", 23, 18.214699 s, 173 MB/s) = 23 <0.000008>
[pid 1674] 1566662122.059433 read(0, <unfinished ...>
[pid 1674] 1566662123.082491 +++ killed by SIGKILL +++

Note how the first 100MB chunk took 17 seconds, the second chunk was
interrupted within ~2 seconds after I sent a SIGKILL.

Find below a variant of Tetsuo's patch, that uses a more regular signal
return pattern that you suggested in your other mail:

> (Ok, in this case I think it wants
>
> err = -EINTR;
> if (fatal_signal_pending(current))
> break;
>
> instead, but the point is to make it look like signal handling, just
> with the special "fatal signals can sometimes be handled even when
> regular signals might not make it through".

Except that I think the regular pattern here is not 'break' but 'goto
failed' - 'break' would just result in a partial read and 'err' gets
ignored.

I also agree that we should probably allow regular interrupts to
interrupt /dev/mem accesses - while the 'dd' process is killable, it's
not Ctrl-C-able.

If I change the fatal_signal_pending() to signal_pending() it all behaves
much better IMHO - assuming that no utility learned rely on
non-interruptibility ...

The second patch is attached below the first one.

Independently of this I also agree with Tetsuo that a cond_resched()
might be in order to break up this loop on non-preempt kernels. The third
patch below implements this variant.

Is this closer to what you had in mind?

Thanks,

Ingo

---
drivers/char/mem.c | 9 +++++++++
1 file changed, 9 insertions(+)

Index: linux/drivers/char/mem.c
===================================================================
--- linux.orig/drivers/char/mem.c
+++ linux/drivers/char/mem.c
@@ -175,6 +175,15 @@ static ssize_t read_mem(struct file *fil
p += sz;
count -= sz;
read += sz;
+
+ /*
+ * Reading from iomem areas of /dev/mem can be slow,
+ * depending on the hardware, so allow such read()s
+ * to be interrupted via fatal signals (SIGKILL):
+ */
+ err = -EINTR;
+ if (fatal_signal_pending(current))
+ goto failed;
}
kfree(bounce);


drivers/char/mem.c | 9 +++++++++
1 file changed, 9 insertions(+)

Index: linux/drivers/char/mem.c
===================================================================
--- linux.orig/drivers/char/mem.c
+++ linux/drivers/char/mem.c
@@ -175,6 +175,15 @@ static ssize_t read_mem(struct file *fil
p += sz;
count -= sz;
read += sz;
+
+ /*
+ * Reading from iomem areas of /dev/mem can be slow,
+ * depending on the hardware, so allow such read()s
+ * to be interruptible:
+ */
+ err = -EINTR;
+ if (signal_pending(current))
+ goto failed;
}
kfree(bounce);

Index: linux/drivers/char/mem.c
===================================================================
--- linux.orig/drivers/char/mem.c
+++ linux/drivers/char/mem.c
@@ -175,6 +175,18 @@ static ssize_t read_mem(struct file *fil
p += sz;
count -= sz;
read += sz;
+
+ /*
+ * Reading from iomem areas of /dev/mem can be slow,
+ * depending on the hardware, so allow such read()s
+ * to be interruptible and preemptible:
+ */
+ err = -EINTR;
+ if (signal_pending(current))
+ goto failed;
+
+ if (need_resched())
+ cond_resched();
}
kfree(bounce);