Re: [PATCH] pciehp: Fix race condition handling surprise link-down
From: Bjorn Helgaas
Date: Fri Feb 03 2017 - 11:51:20 EST
On Thu, Feb 02, 2017 at 10:00:53PM -0800, Raj, Ashok wrote:
> Hi Bjorn
>
> On Thu, Feb 02, 2017 at 08:59:01PM -0600, Bjorn Helgaas wrote:
> > Hi Ashok,
> >
> > Sorry it took me so long to review this. I never felt like I really
> > understood it, and it took me a long time to try to figure out a more
> > useful response.
>
> No worries. Agree its a litte tricky, and took me several iterations before
> doing someting that was simple enough, without a complete overhaul of
> state management.
>
> Thanks a ton for capturing the sequence, I did capture
> some debug output along at that time. My apologies for not adding it
> along. But this becomes excellant notes and perhaps would be good to
> capture in commit or in the documentation. Going through this isn't fun :-)
Maybe you could open a kernel.org bugzilla and attach the dmesg log
and "lspci -vv" output. Then we could capture some of your logs and
this discussion there and include a pointer in the changelog.
> Responses below:
> > >
> > > This patch fixes that by setting the p_slot->state only when the work to
> > > handle the power event is executing, protected by the p_slot->hotplug_lock.
> >
> > So let me first try to understand what's going on with the current
> > code. In the normal case where a device is removed or turned off and
> > pciehp can complete everything before another device appears, I think
> > the flow is like this:
>
> You got this problem part right. Spot on!
> >
> > p_slot->state == STATIC_STATE (powered on, link up)
> >
> > <-- surprise link down interrupt
> > pciehp_isr()
> > queue INT_LINK_DOWN work
> >
> > interrupt_event_handler(INT_LINK_DOWN)
> > set p_slot->state = POWEROFF_STATE
> > queue DISABLE_REQ work
> >
> > pciehp_power_thread(DISABLE_REQ)
> > send PCI_EXP_SLTCTL_PWR_OFF command
> > wait for power-off to complete
> > set p_slot->state = STATIC_STATE
> >
> > p_slot->state == STATIC_STATE (powered off)
> >
> > In the problem case, the link goes down, and while pciehp is still
> > dealing with that, the link comes back up. So I think one possible
> > sequence is like this:
> >
> > p_slot->state == STATIC_STATE (powered on, link up)
> >
> > <-- surprise link down interrupt
> > 1a pciehp_isr()
> > queue INT_LINK_DOWN work # queued: 1-LD
> >
> > 1b interrupt_event_handler(INT_LINK_DOWN) # process 1-LD
> > # handle_link_event() sees case STATIC_STATE
> > set p_slot->state = POWEROFF_STATE
> > queue DISABLE_REQ work # queued: 1-DR
> >
> > <-- surprise link up interrupt
> > 2a pciehp_isr()
> > queue INT_LINK_UP work # queued: 1-DR 2-LU
> >
> > 1c pciehp_power_thread(DISABLE_REQ) # process 1-DR
> > send PCI_EXP_SLTCTL_PWR_OFF command
> > wait for power-off to complete
> > set p_slot->state = STATIC_STATE
> >
> > <-- link down interrupt (result of PWR_OFF)
> > 3a pciehp_isr()
> > queue INT_LINK_DOWN work # queued: 2-LU 3-LD
> >
> > 2b interrupt_event_handler(INT_LINK_UP) # process 2-LU
> > # handle_link_event() sees case STATIC_STATE
> > set p_slot->state = POWERON_STATE
> > queue ENABLE_REQ work # queued: 3-LD 2-ER
> >
> > 3b interrupt_event_handler(INT_LINK_DOWN) # process 3-LD
> > # handle_link_event() sees case POWERON_STATE, so we emit
> > # "Link Down event queued; currently getting powered on"
> > set p_slot->state = POWEROFF_STATE
> > queue DISABLE_REQ work # queued: 2-ER 3-DR
> >
> > 2c pciehp_power_thread(ENABLE_REQ) # process 2-ER
> > send PCI_EXP_SLTCTL_PWR_ON command
> > wait for power-on to complete
> > set p_slot->state = STATIC_STATE
> >
> > <-- link up interrupt (result of PWR_ON)
> > 4a pciehp_isr()
> > queue INT_LINK_UP work # queued: 3-DR 4-LU
> >
> > 3c pciehp_power_thread(DISABLE_REQ) # process 3-DR
> > send PCI_EXP_SLTCTL_PWR_OFF command
> > wait for power-off to complete
> > set p_slot->state = STATIC_STATE
> >
> > <-- link down interrupt (result of PWR_OFF)
> > 5a pciehp_isr()
> > queue INT_LINK_DOWN work # queued: 4-LU 5-LD
> >
> > State 5a is the same as 3a (we're in STATIC_STATE with Link Up and
> > Link Down work items queued), so the whole cycle can repeat.
> >
> > Now let's assume we apply this patch and see what changes. The patch
> > changes where we set p_slot->state. Currently we set POWEROFF_STATE
> > or POWERON_STATE in the interrupt_event_handler() work item. The
> > patch moves that to the pciehp_power_thread() work item, where the
> > power commands are actually sent.
>
> Right. The difference with this patch is when we set the state to
> POWERON_STATE or POWEROFF_STATE, we only do that when the previous
> POWER* operation has entirely completed. Since now its protected with the
> hotplug_lock mutex.
>
> In the problem case, since we set the state before the pciehp_power_thread,
> we end up changing the state to POWER*_STATE before the previous POWER*
> action has completed.
> >
> > p_slot->state == STATIC_STATE (powered on, link up)
> >
> > <-- surprise link down interrupt
> > 1A pciehp_isr()
> > queue INT_LINK_DOWN work # queued: 1-LD
> >
> > 1B interrupt_event_handler(INT_LINK_DOWN) # process 1-LD
> > # handle_link_event() sees case STATIC_STATE
> > # set p_slot->state = POWEROFF_STATE # (removed by patch)
> > queue DISABLE_REQ work # queued: 1-DR
> >
> > <-- surprise link up interrupt
> > 2A pciehp_isr()
> > queue INT_LINK_UP work # queued: 1-DR 2-LU
> >
> > 1C pciehp_power_thread(DISABLE_REQ) # process 1-DR
>
> Also mutex hotplug_lock is held.
>
> > set p_slot->state = POWEROFF_STATE # (added by patch)
> > send PCI_EXP_SLTCTL_PWR_OFF command
> > wait for power-off to complete
> > set p_slot->state = STATIC_STATE
> >
> > <-- link down interrupt (result of PWR_OFF)
> > 3A pciehp_isr()
> > queue INT_LINK_DOWN work # queued: 2-LU 3-LD
>
> The above INT_LINK_DOWN will eventually be ignored in handle_link_event()
> because we are in POWEROFF_STATE, and a link down while in POWEROFF will
> be ignored.
> >
> > 2B interrupt_event_handler(INT_LINK_UP) # process 2-LU
> > # handle_link_event() sees case STATIC_STATE
> > # set p_slot->state = POWERON_STATE # (removed by patch)
> > queue ENABLE_REQ work # queued: 3-LD 2-ER
> >
> > 3B interrupt_event_handler(INT_LINK_DOWN) # process 3-LD
> > # handle_link_event() sees case STATIC_STATE,
> > # unlike 3b above, which saw POWERON_STATE;
> > # doesn't emit a message, but still queues DISABLE_REQ
> > # set p_slot->state = POWEROFF_STATE # (removed by patch)
> > queue DISABLE_REQ work # queued: 2-ER 3-DR
>
> 3B will be ignored, since handle_link_event() knows we are in process
> of POWEROFF.
What enforces this ordering? handle_link_event() will only see
POWEROFF_STATE if it happens to read the state after
pciehp_power_thread() sets POWEROFF_STATE and before it
sets it back to STATIC_STATE. Given our work item concurrency,
I think that's possible, but I don't see how it's guaranteed.
> > 2C pciehp_power_thread(ENABLE_REQ) # process 2-ER
>
> We are also protected by mutex hotplug_lock here. So the following
> wont get executed until step 1C has run to completion and the
> mutex is released.
>
> > set p_slot->state = POWERON_STATE # (added by patch)
> > send PCI_EXP_SLTCTL_PWR_ON command
> > wait for power-on to complete
> > set p_slot->state = STATIC_STATE
> >
> > <-- link up interrupt (result of PWR_ON)
> > 4A pciehp_isr()
> > queue INT_LINK_UP work # queued: 3-DR 4-LU
>
> handle_link_event() would eventually dismiss the INT_LINK_UP since
> it knows we are in process of POWERON.
> >
> > 3C pciehp_power_thread(DISABLE_REQ) # process 3-DR
> > set p_slot->state = POWEROFF_STATE # (added by patch)
> > send PCI_EXP_SLTCTL_PWR_OFF command
> > wait for power-off to complete
> > set p_slot->state = STATIC_STATE
> >
> > <-- link down interrupt (result of PWR_OFF)
> > 5A pciehp_isr()
> > queue INT_LINK_DOWN work # queued: 4-LU 5-LD
> >
> > With this particular ordering, I think we still have the same problem:
> > 5A is the same as 3A, so I think the cycle could repeat.
>
> I think the sequence is almost right, except the fact since we are protected
> by hotplug_lock, we don't allow another POWERON or POWEROFF to be processed
> until the previous POWER* operation is completed entirely.
handle_link_event() is protected by "lock" but not by "hotplug_lock",
so I think it can queue ENABLE/DISABLE items even before the previous
POWER* operation completes.
You're right that I omitted the hotplug_lock details. I added them to
my outline (at https://goo.gl/szqWTC if you're interested), but I
don't see how that prevents the scenario above.
> Just to summarize, we only queue the POWEROFF due to surprise link down
> and another POWERON due to link becoming back up. The transient link-down
> events are coveniently ignored.
I'm leery about ignoring events, though it happens to be convenient in
this case. I think we're ignoring them because we're running work
items simultaneously with other items, and I think that concurrency is
unnecessary complexity.
I think it would be safer to queue every event and process every event
serially.