Re: [PATCH] iwlwifi: don't panic in error path on non-msix systems

From: Kirtika Ruchandani
Date: Mon Apr 22 2019 - 22:35:16 EST


On Mon, Apr 22, 2019 at 4:56 PM Kirtika Ruchandani <kirtika@xxxxxxxxxx> wrote:
>
> On Mon, Apr 22, 2019 at 11:07 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > On Wed 17-04-19 10:35:16, Luca Coelho wrote:
> > > From: Shahar S Matityahu <shahar.s.matityahu@xxxxxxxxx>
> > >
> > > The driver uses msix causes-register to handle both msix and non msix
> > > interrupts when performing sync nmi. On devices that do not support
> > > msix this register is unmapped and accessing it causes a kernel panic.
> > >
> > > Solve this by differentiating the two cases and accessing the proper
> > > causes-register in each case.
>
> Are you sure reading CSR_INT from trans.c without explicitly getting irq_lock.c
> like rx.c does, is thread-safe? I don't claim to understand this fully, but this
> smells wrong from past experience with this driver. I'll see if I can cook up
> a test case with a race condition here.

Sorry for the typos. I meant "writing (not reading) to CSR_INT in
iwl_trans_pcie_sync_nmi
without explicitly getting trans_pcie->irq_lock like iwl_pcie_irq_handler does".
I spent some time playing around this, and while I don't have a black-and-white
test-case to show this patch has side-effects (at the very least), I
have some notes:

1. Repeatedly sending fw_nmi is a good test-case for the problem this patch
is trying to solve. i.e. I had
$ cd /sys/kernel/debug/iwlwifi/${PCI_ID}/iwlmvm/
$ while true; do echo 1 > fw_nmi ; done # this is likely too harsh
and needs a sleep in b/w

With the current ToT driver in wireless-drivers-next, this ramoops-es
super quickly after hitting
the problematic MSI-X read.

2. With this patch applied, I hit an ADVANCED_SYSASSERT 0x0 and it took ~800ms
and dozens of kernel warnings before the driver "recovered".
I've sent the full dmesg to Luca off-list.
It didn't get any better with adding spin_lock(&trans_pcie->irq_lock)
/ spin_unlock(&trans_pcie->irq_lock)
around the problematic CSR_INT write in iwl_trans_sync_nmi.

So I still don't have any concrete reasoning or proof, except a dmesg
after a test that doesn't look right.
Apologies if this discussion is moot - FWIW, the thing that caused a
code smell in the first place
was that historically, nothing in trans.c had mucked with writing to
CSR_INT, except when enabling interrupts again
with a 0xFFFFFFFF at _initialization time_ (and we've seen race
conditions there in the past).
You can check this with `git grep -p -w CSR_INT trans.c`.

> > >
> > > Reported-by: Michal Hocko <mhocko@xxxxxxxxxx>
> > > Signed-off-by: Shahar S Matityahu <shahar.s.matityahu@xxxxxxxxx>
> > > Signed-off-by: Luca Coelho <luciano.coelho@xxxxxxxxx>
> >
> > $ dmesg | grep "Error sending SCAN_CFG_CMD:"
> > [49786.288548] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> > [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> >
> > without the oops and with the iwlwifi internal dump IIUC which is the
> > previous behavior.
> > [53457.166877] iwlwifi 0000:01:00.0: Error sending SCAN_CFG_CMD: time out after 2000ms.
> > [53457.166882] iwlwifi 0000:01:00.0: Current CMD queue read_ptr 224 write_ptr 225
> > [53457.414973] iwlwifi 0000:01:00.0: HW error, resetting before reading
> > [53457.421339] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
> > [53457.421345] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1269232956
> > [53457.421347] iwlwifi 0000:01:00.0: Loaded firmware version: 36.9f0a2d68.0
> > [53457.421350] iwlwifi 0000:01:00.0: 0x45E91306 | ADVANCED_SYSASSERT
> > [53457.421352] iwlwifi 0000:01:00.0: 0x2F58D384 | trm_hw_status0
> > [53457.421353] iwlwifi 0000:01:00.0: 0x7F1A8CFD | trm_hw_status1
> > [53457.421355] iwlwifi 0000:01:00.0: 0x07E787FD | branchlink2
> > [53457.421357] iwlwifi 0000:01:00.0: 0xE9E54368 | interruptlink1
> > [53457.421359] iwlwifi 0000:01:00.0: 0x470D9BBF | interruptlink2
> > [53457.421361] iwlwifi 0000:01:00.0: 0xAF040E7E | data1
> > [53457.421362] iwlwifi 0000:01:00.0: 0xE7FBCA48 | data2
> > [53457.421364] iwlwifi 0000:01:00.0: 0x4E4A8288 | data3
> > [53457.421366] iwlwifi 0000:01:00.0: 0x861DEA98 | beacon time
> > [53457.421368] iwlwifi 0000:01:00.0: 0xE8F23466 | tsf low
> > [53457.421369] iwlwifi 0000:01:00.0: 0xD7B19307 | tsf hi
> > [53457.421371] iwlwifi 0000:01:00.0: 0xE58934E3 | time gp1
> > [53457.421373] iwlwifi 0000:01:00.0: 0xB013FEBE | time gp2
> > [53457.421375] iwlwifi 0000:01:00.0: 0x962DCC75 | uCode revision type
> > [53457.421376] iwlwifi 0000:01:00.0: 0xFF8FB30F | uCode version major
> > [53457.421378] iwlwifi 0000:01:00.0: 0x0DD08E17 | uCode version minor
> > [53457.421380] iwlwifi 0000:01:00.0: 0x87FD70DE | hw version
> > [53457.421382] iwlwifi 0000:01:00.0: 0x853F6851 | board version
> > [53457.421384] iwlwifi 0000:01:00.0: 0x08D7F330 | hcmd
> > [53457.421385] iwlwifi 0000:01:00.0: 0x6B7E5FEE | isr0
> > [53457.421387] iwlwifi 0000:01:00.0: 0x2B1E7CD4 | isr1
> > [53457.421389] iwlwifi 0000:01:00.0: 0x3F133B16 | isr2
> > [53457.421391] iwlwifi 0000:01:00.0: 0x5D480C5A | isr3
> > [53457.421392] iwlwifi 0000:01:00.0: 0x34E93EBA | isr4
> > [53457.421394] iwlwifi 0000:01:00.0: 0x42AD8E83 | last cmd Id
> > [53457.421396] iwlwifi 0000:01:00.0: 0x1F5BBCFF | wait_event
> > [53457.421398] iwlwifi 0000:01:00.0: 0x6808B2C1 | l2p_control
> > [53457.421399] iwlwifi 0000:01:00.0: 0x0D5B1F33 | l2p_duration
> > [53457.421401] iwlwifi 0000:01:00.0: 0xF4C94535 | l2p_mhvalid
> > [53457.421403] iwlwifi 0000:01:00.0: 0x3DCE6EBB | l2p_addr_match
> > [53457.421405] iwlwifi 0000:01:00.0: 0xFDDC41FE | lmpm_pmg_sel
> > [53457.421406] iwlwifi 0000:01:00.0: 0xB53A17F5 | timestamp
> > [53457.421408] iwlwifi 0000:01:00.0: 0x5A6A4113 | flow_handler
> > [53457.421474] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
> > [53457.421477] iwlwifi 0000:01:00.0: Status: 0x00000100, count: 1182976748
> > [53457.421478] iwlwifi 0000:01:00.0: 0x62D2BDB3 | ADVANCED_SYSASSERT
> > [53457.421480] iwlwifi 0000:01:00.0: 0x4D9E5019 | umac branchlink1
> > [53457.421482] iwlwifi 0000:01:00.0: 0x8CB69F6E | umac branchlink2
> > [53457.421484] iwlwifi 0000:01:00.0: 0x9868662D | umac interruptlink1
> > [53457.421486] iwlwifi 0000:01:00.0: 0x9800F8F7 | umac interruptlink2
> > [53457.421488] iwlwifi 0000:01:00.0: 0xC71449B8 | umac data1
> > [53457.421489] iwlwifi 0000:01:00.0: 0xAB0AB17F | umac data2
> > [53457.421491] iwlwifi 0000:01:00.0: 0x6C6F9753 | umac data3
> > [53457.421493] iwlwifi 0000:01:00.0: 0xFC49D724 | umac major
> > [53457.421495] iwlwifi 0000:01:00.0: 0xA61CC627 | umac minor
> > [53457.421496] iwlwifi 0000:01:00.0: 0x45BAA0B8 | frame pointer
> > [53457.421498] iwlwifi 0000:01:00.0: 0x319D112B | stack pointer
> > [53457.421500] iwlwifi 0000:01:00.0: 0xEFD9E2E9 | last host cmd
> > [53457.421502] iwlwifi 0000:01:00.0: 0x82640FF7 | isr status reg
> > [53457.421506] ieee80211 phy0: Hardware restart was requested
> > [53457.941685] iwlwifi 0000:01:00.0: Queue 0 is inactive on fifo 0 and stuck for 2500 ms. SW [224, 225] HW [0, 0] FH TRB=0x02b759ca1
> >
> > Feel free to add
> > Tested-by: Michal Hocko <mhocko@xxxxxxxx>
> >
> > Thanks for your quick patch and sorry it took so long from my side.
> >
> > --
> > Michal Hocko
> > SUSE Labs