Re: [PATCH] gpio: cdev: fix a crash on line-request release

From: Bartosz Golaszewski
Date: Wed May 24 2023 - 15:42:29 EST


On Wed, May 24, 2023 at 6:36 AM Kent Gibson <warthog618@xxxxxxxxx> wrote:
>
> On Wed, May 24, 2023 at 10:09:42AM +0800, Kent Gibson wrote:
> > On Wed, May 24, 2023 at 07:58:36AM +0800, Kent Gibson wrote:
> > > On Tue, May 23, 2023 at 05:51:01PM +0200, Bartosz Golaszewski wrote:
> > > > When a GPIO device is forcefully unregistered, we are left with an
> > > > inactive object. If user-space kept an open file descriptor to a line
> > > > request associated with such a structure, upon closing it, we'll see the
> > > > kernel crash due to freeing unexistent GPIO descriptors.
> > > >
> > >
> > > > @@ -1565,17 +1571,21 @@ static ssize_t linereq_read(struct file *file, char __user *buf,
> > > >
> > > > static void linereq_free(struct linereq *lr)
> > > > {
> > > > + struct gpio_device *gdev = lr->gdev;
> > > > unsigned int i;
> > > >
> > > > for (i = 0; i < lr->num_lines; i++) {
> > > > if (lr->lines[i].desc) {
> > > > edge_detector_stop(&lr->lines[i]);
> > > > - gpiod_free(lr->lines[i].desc);
> > > > + down_write(&gdev->sem);
> > > > + if (gdev->chip)
> > > > + gpiod_free(lr->lines[i].desc);
> > > > + up_write(&gdev->sem);
> >
> > Ummm, taking another look at the oops I sent you, the crash actually
> > occurs in edge_detector_stop():
> >
> > May 23 11:47:06 firefly kernel: [ 4216.877056] Call Trace:
> > May 23 11:47:06 firefly kernel: [ 4216.877512] <TASK>
> > May 23 11:47:06 firefly kernel: [ 4216.877924] irq_domain_deactivate_irq+0x19/0x30
> > May 23 11:47:06 firefly kernel: [ 4216.878543] free_irq+0x257/0x360
> > May 23 11:47:06 firefly kernel: [ 4216.879056] linereq_free+0x9b/0xe0
> > May 23 11:47:06 firefly kernel: [ 4216.879608] linereq_release+0xc/0x20
> > May 23 11:47:06 firefly kernel: [ 4216.880230] __fput+0x87/0x240
> > May 23 11:47:06 firefly kernel: [ 4216.880744] task_work_run+0x54/0x80
> >
> > That free_irq() call is in edge_detector_stop() (which apparently is inlined),
> > not in gpiod_free().
> >
> > So pretty sure this patch doesn't even solve my problem, but I will test
> > it to confirm.
> >
>
> Yeah, doesn't fix my problem still crashes.
>
> If the line request doesn't have edge detection enabled (so no
> irq) then I don't get a crash.
> i.e. use gpioset to request the line, rather than gpiomon.
>
> (To provide background for anyone else trying to follow along, the
> scenario is:
> 1. create a gpio-sim
> 2. request a line
> 3. destroy the gpio-sim
> 4. release the line.
>
> 3 triggers this error:
>
> May 24 11:11:12 firefly kernel: [ 200.027280] gpio_stub_drv gpiochip0: REMOVING GPIOCHIP WITH GPIOS STILL REQUESTED
>
> and 4 triggers a crash - if the requested line holds an irq.)
>
> I would point out:
> /**
> * gpiochip_remove() - unregister a gpio_chip
> * @gc: the chip to unregister
> *
> * A gpio_chip with any GPIOs still requested may not be removed.
> */
> void gpiochip_remove(struct gpio_chip *gc)
>
> which is where that dev_crit() is, so destroying the gpio-sim has already
> invalidated that contract.
>
> Anyway, it seems my problem is the forced removal of the gpiochip invalidates
> the irq that the line request is holding.
> Not sure how best to deal with that.
>
> Moving the edge_detector_stop() inside the "if (gdev->chip)" check of
> your patch does prevent crash.
> But in that case edge_detector_stop() does other cleanup that is no longer
> getting done.
> Perhaps if the chip is gone then zero line->irq prior to calling
> edge_detector_stop()?
> Again, this is starting to feel like a hack for gpiolib not being good
> at telling the client that it has to pull the rug.
> Though according the the gpiochip_remove() docs, it WONT pull the rug,
> so you get that.
>
> Cheers,
> Kent.

Interestingly enough, I did test it just like you and the "fix" seemed
to address the issue. Upon a further look at the code, it's of course
clear that the patch is wrong.

I wanted to debug the code to see what's happening exactly but it
turned out that enabling the generation of DWARF data hid the issue as
well even without any fix. It means that it's some kind of a memory
corruption rather than a regular NULL-pointer dereference.

I'm not yet sure where the crash happens exactly other that it's in
the irq domain code.

Anyway, I'll be back at it tomorrow.

Bart