On Fri, Jul 12, 2024 at 02:50:56PM -0700, Roman Kisel wrote:Thanks, will update the code to be consistent with the existing logging.
Missing, failed, or corrupted core dumps might impede crash
investigations. To improve reliability of that process and consequently
the programs themselves, one needs to trace the path from producing
a core dumpfile to analyzing it. That path starts from the core dump file
written to the disk by the kernel or to the standard input of a user
mode helper program to which the kernel streams the coredump contents.
There are cases where the kernel will interrupt writing the core out or
produce a truncated/not-well-formed core dump without leaving a note.
Add logging for the core dump collection failure paths to be able to reason
what has gone wrong when the core dump is malformed or missing.
Signed-off-by: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx>
---
fs/binfmt_elf.c | 60 ++++++++++++++++-----
fs/coredump.c | 109 ++++++++++++++++++++++++++++++++-------
include/linux/coredump.h | 8 ++-
kernel/signal.c | 22 +++++++-
4 files changed, 165 insertions(+), 34 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a43897b03ce9..cfe84b9436af 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1994,8 +1994,11 @@ static int elf_core_dump(struct coredump_params *cprm)
* Collect all the non-memory information about the process for the
* notes. This also sets up the file header.
*/
- if (!fill_note_info(&elf, e_phnum, &info, cprm))
+ if (!fill_note_info(&elf, e_phnum, &info, cprm)) {
+ pr_err_ratelimited("Error collecting note info, core dump of %s(PID %d) failed\n",
+ current->comm, current->pid);
A couple things come to mind for me as I scanned through this:
- Do we want to report pid or tgid?
- Do we want to report the global value or the current pid namespace
mapping?
Because I notice that the existing code:
printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
task_tgid_vnr(current), current->comm);
Is reporting tgid for current's pid namespace. We should be consistent.
I think all of this might need cleaning up first before adding new100% agreed.
reports. We should consolidate the reporting into a single function so
this is easier to extend in the future. Right now the proposed patch is
hand-building the report, and putting pid/comm in different places (at
the end, at the beginning, with/without "of", etc), which is really just
boilerplate repetition.
Absolutely love that! Couldn't possibly appreciate your help more :)
How about creating:
static void coredump_report_failure(const char *msg)
{
char comm[TASK_COMM_LEN];
task_get_comm(current, comm);
pr_warn_ratelimited("coredump: %d(%*pE): %s\n",
task_tgid_vnr(current), strlen(comm), comm, msg);
}
Then in a new first patch, convert all the existing stuff:
printk(KERN_WARNING ...)
pr_info(...)
etc
Since even the existing warnings are inconsistent and don't escape
newlines, etc. :)
Then in patch 2 use this to add the new warnings?