Re: [PATCH v2 1/1] binfmt_elf, coredump: Log the reason of the failed core dumps

From: Roman Kisel
Date: Sun Jul 14 2024 - 10:34:07 EST




On 7/13/2024 9:31 AM, Kees Cook wrote:
On Fri, Jul 12, 2024 at 02:50:56PM -0700, Roman Kisel wrote:
Missing, failed, or corrupted core dumps might impede crash
investigations. To improve reliability of that process and consequently
the programs themselves, one needs to trace the path from producing
a core dumpfile to analyzing it. That path starts from the core dump file
written to the disk by the kernel or to the standard input of a user
mode helper program to which the kernel streams the coredump contents.
There are cases where the kernel will interrupt writing the core out or
produce a truncated/not-well-formed core dump without leaving a note.

Add logging for the core dump collection failure paths to be able to reason
what has gone wrong when the core dump is malformed or missing.

Signed-off-by: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx>
---
fs/binfmt_elf.c | 60 ++++++++++++++++-----
fs/coredump.c | 109 ++++++++++++++++++++++++++++++++-------
include/linux/coredump.h | 8 ++-
kernel/signal.c | 22 +++++++-
4 files changed, 165 insertions(+), 34 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43897b03ce9..cfe84b9436af 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1994,8 +1994,11 @@ static int elf_core_dump(struct coredump_params *cprm)
* Collect all the non-memory information about the process for the
* notes. This also sets up the file header.
*/
- if (!fill_note_info(&elf, e_phnum, &info, cprm))
+ if (!fill_note_info(&elf, e_phnum, &info, cprm)) {
+ pr_err_ratelimited("Error collecting note info, core dump of %s(PID %d) failed\n",
+ current->comm, current->pid);

A couple things come to mind for me as I scanned through this:

- Do we want to report pid or tgid?
- Do we want to report the global value or the current pid namespace
mapping?

Because I notice that the existing code:

printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
task_tgid_vnr(current), current->comm);

Is reporting tgid for current's pid namespace. We should be consistent.

Thanks, will update the code to be consistent with the existing logging.

I think all of this might need cleaning up first before adding new
reports. We should consolidate the reporting into a single function so
this is easier to extend in the future. Right now the proposed patch is
hand-building the report, and putting pid/comm in different places (at
the end, at the beginning, with/without "of", etc), which is really just
boilerplate repetition.
100% agreed.


How about creating:

static void coredump_report_failure(const char *msg)
{
char comm[TASK_COMM_LEN];

task_get_comm(current, comm);

pr_warn_ratelimited("coredump: %d(%*pE): %s\n",
task_tgid_vnr(current), strlen(comm), comm, msg);
}

Then in a new first patch, convert all the existing stuff:

printk(KERN_WARNING ...)
pr_info(...)
etc

Since even the existing warnings are inconsistent and don't escape
newlines, etc. :)

Then in patch 2 use this to add the new warnings?
Absolutely love that! Couldn't possibly appreciate your help more :)



--
Thank you,
Roman