Second-source devices and interrupt-mapping race
From: Johan Hovold
Date: Thu Jul 28 2022 - 05:30:18 EST
Hi Marc, Rob and Krzysztof,
When adding support for the new Lenovo Thinkpad X13s laptop, we realised
that it comes with two different touchpad controllers.
To enable some early adopters to use the alternate touchpad, I tried to
enable both nodes in the devicetree and have the i2c-hid driver bind to
the one that is actually present.
This turned out to be racy due to the hid driver in question enabling
async probing so that the populated and non-populated nodes can be
probed in parallel, which in turn lead to some interesting findings.
Specifically, it seems like the interrupt-domain mapping code is racy in
that it can return two different mappings for the same hwirq, and when
the hid driver enables one of them, this may end up looking like
spurious interrupts for the other mapping:
[ +0.014042] i2c_hid_of 0-002c: i2c_device_probe
[ +0.000001] i2c_hid_of 0-0015: i2c_device_probe
[ +0.000025] i2c_hid_of 0-002c: i2c_device_probe - irq mapped (166)
[ +0.000013] i2c_hid_of 0-0015: i2c_device_probe - irq mapped (167)
[ +0.000051] i2c_hid_of 0-002c: supply vddl not found, using dummy regulator
[ +0.000056] i2c_hid_of 0-0015: supply vddl not found, using dummy regulator
[ +0.000016] i2c_hid_of 0-002c: HID probe called for i2c 0x2c
[ +0.000374] i2c_hid_of 0-0015: HID probe called for i2c 0x15
...
[ +0.000180] i2c_hid_of 0-002c: Requesting IRQ: 166
[ +0.000045] irq 167, desc: (____ptrval____), depth: 1, count: 0, unhandled: 0
[ +0.000014] ->handle_irq(): (____ptrval____), handle_bad_irq+0x0/0x220
[ +0.000015] ->irq_data.chip(): (____ptrval____), msm_gpio_irq_chip+0x0/0x108
[ +0.000011] ->action(): 0000000000000000
[ +0.000006] IRQ_NOPROBE set
The interrupt is eventually disabled and the populated device fails to
probe. Note that this only happens intermittently.
This second-source example could obviously be dismissed as something
which is simply not supported (the boot firmware should have made sure
only the populated node was enabled), but what if there were actually
two separate devices sharing an interrupt and that now end up with two
different virq?
Async probing has been around for a while now and needs to be supported,
even if the platform bus doesn't use it (yet).
TL;DR:
1. Marc, does the irq mapping code need to be serialised to handle the
valid case of two devices sharing an interrupt being probed in parallel?
It may not be a common setup, but correctness first?
I've just posted a patch that should address this here:
https://lore.kernel.org/r/20220728092710.21190-1-johan+linaro@xxxxxxxxxx
2. Rob, Krzysztof, I assume that handling second-source devices by
enabling multiple variants in the devicetree can not be considered
correct?
What about the related case of simply non-populated devices (e.g. laptop
variants without a touchscreen)?
Note that we have at least two cases of "second-source" nodes in mainline
("rtc" and "trackpad", respectively):
85a9efcd4e29 ("ARM: mvebu: add DT support for Seagate NAS 2 and 4-Bay")
689b937bedde ("arm64: dts: mediatek: add mt8173 elm and hana board")
and that, for example, the i2c-hid driver explicitly supports
non-populated devices:
b3a81b6c4fc6 ("HID: i2c-hid: check if device is there before really probing")
and the commit message indicates that this is something that Chromebooks
rely on.
For the X13s, I'm not sure how we would go about to tell the variants
apart (the ACPI tables that Windows use include both touchpads and an
optional touchscreen). In the end, the boot firmware might need to
resort to a similar kind of probing if we don't allow the kernel to do
it.
Finally, note that while disabling async probing for "second-source"
nodes (e.g. if we could mark them as requiring that) would take care of
the irq-mapping race, we'd still currently also need to move any
pinconfig handles to the parent bus node (as is also done in one of the
in-tree examples above) to suppress the corresponding pinctrl errors in
case the populated device is probed and bound first:
[ +0.010217] sc8280xp-tlmm f100000.pinctrl: pin GPIO_182 already requested by 0-0015; cannot claim for 0-002c
Johan