Re: [PATCH] devcoredump: increase the device delete timeout to 10 mins

From: Abhinav Kumar
Date: Tue Feb 08 2022 - 17:21:55 EST


Hi Johannes

Thanks for the response.

On 2/8/2022 12:35 PM, Johannes Berg wrote:
On Tue, 2022-02-08 at 11:44 -0800, Abhinav Kumar wrote:
There are cases where depending on the size of the devcoredump and the speed
at which the usermode reads the dump, it can take longer than the current 5 mins
timeout.

This can lead to incomplete dumps as the device is deleted once the timeout expires.

One example is below where it took 6 mins for the devcoredump to be completely read.

04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node

Increase the timeout to 10 mins to accommodate system delays and large coredump
sizes.


No real objection, I guess, but can the data actually disappear *while*
the sysfs file is open?!

Or did it take 5 minutes to open the file?

If the former, maybe we should fix that too (or instead)?

johannes

It opened the file rightaway but could not finish reading.

The device gets deleted so the corresponding /data will disappear too ( as the data node is under devcd*/data)

60 static void devcd_del(struct work_struct *wk)
61 {
62 struct devcd_entry *devcd;
63
64 devcd = container_of(wk, struct devcd_entry, del_wk.work);
65
66 device_del(&devcd->devcd_dev);
67 put_device(&devcd->devcd_dev);
68 }

Are you suggesting we implement a logic like :

a) if the usermode has started reading the data but has not finished yet ( we can detect the former with something like devcd->data_read_ongoing = 1 and we know it has finished when it acks and we can clear this flag then), in the timeout del_wk then we can delay the the delete timer by another TIMEOUT amount of time to give usermode time to finish the data?

b) If usermode acks, we will clear both the flag and delete the device as usual

But there is a corner case here:

c) If usermode starts the read, but then for some reason crashes, the timer will timeout and try to delete the device but will detect that usermode is still reading and will keep the device. How do we detect this case?

Thats why i thought maybe the easier way right now is to try increasing the timeout.