Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

From: Souza, Jose
Date: Wed Feb 28 2024 - 12:56:17 EST


On Wed, 2024-02-28 at 18:05 +0100, Johannes Berg wrote:
> > Current 5-minute timeout may be too short for users to search and
> > understand what needs to be done to capture coredump to report bugs.
>
> Conceptually, I'm not sure I understand this. Users should probably have
> a script to capture coredumps to a file in the filesystem, possibly with
> additional data such as 'dmesg' at the time of the dump.
>
> Having this stick around longer in core kernel memory (not even
> swappable) seems like a bad idea?
>
> What kind of timeout were you thinking? Maybe you'd want 10 minutes? An
> hour?
>
> Also, then, why should the timeout be device-specific? If the user is
> going to need time to find stuff, then surely that applies regardless of
> the device?
>
> So ... I guess I don't really like this, and don't really see how it
> makes sense. Arguably, 5 minutes even is too long, not too short,
> because you should have scripting that captures it, writes it to disk,
> and all that can happen in the space of seconds, rather than minutes.
> It's trivial to write such a script with a udev trigger or similar.
>
> If we wanted to, we could even have a script that not only captures it
> to disk, but also deletes it again from disk after a day or something,
> so if you didn't care you don't get things accumulating. But I don't see
> why the kernel should need to hang on to all the (possibly big) core
> dump in RAM, for whatever time. And I also don't like the device-
> dependency very much, TBH.

In my opinion, the timeout should depend on the type of device driver.

In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.

For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
ending a gaming match, watching a YouTube video, etc.).
If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
instructions alone may take inexperienced Linux users more than five minutes.

I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.

>
>
>
> But if we do go there eventually:
>
> > +void dev_coredumpm(struct device *dev, struct module *owner,
> > + void *data, size_t datalen, gfp_t gfp,
> > + ssize_t (*read)(char *buffer, loff_t offset, size_t count,
> > + void *data, size_t datalen),
> > + void (*free)(void *data))
> > +{
> > + dev_coredumpm_timeout(dev, owner, data, datalen, gfp, read, free,
> > + DEVCD_TIMEOUT);
> > +}
> >  EXPORT_SYMBOL_GPL(dev_coredumpm);
>
> This could be a trivial static inline now, if you just put DEVCD_TIMEOUT
> into the header file. Seems better than exporting another whole function
> for it. Then you also don't need the no-op version of it.

works for me too, will wait an ack in the above explanation.

>
> johannes
>