using device links and OF on dynamic/optional hardware

From: Alvin Šipraga
Date: Fri Jul 19 2024 - 08:23:22 EST


Hello Saravana, Greg,

I would like to get your advice on some driver design relating to a bus
driver I am working on. Specifically I am having some trouble making use
of the device link infrastructure together with the specific use case of
our hardware. I will try and be brief but there are unfortunately a lot
of details so I hope that you have some patience to look at my problem
or CC others who might be able to help. Please let me know if anything
is unclear.

First of all you can find the first instance of my patch series at the
following link [1]. The current form has some shortcomings and I am
working on fixing those, and that's why I would like your help.

[1] https://lore.kernel.org/all/20240517-a2b-v1-0-b8647554c67b@xxxxxxxxxxxxxxx/#t

For further background and introduction to the hardware, I refer you to
some conference slides here [2] which describes one of the
use-cases. More on that later.

[2] https://pqrs.dk/~alvin/docs/a2b.pdf

To give a short summary, the driver is for a bus called A2B which
consists of a daisy-chained series of ICs (so-called A2B nodes)
connected to one another over an unshielded twisted pair. The node
nearest to the host is called the main node and those further downstream
are called subordinate nodes. The control interface to the host is over
I2C. Here is a picture:

┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ │ I2C │ │ UTP │ │ │ │ │ │
│ host ├─────┤ main │XXXXX│ sub1 │XXXXX│ sub2 │XX ... XX│ subN │
│ │ │ │ │ │ │ │ │ │
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘
\______________________________ ... ______/

A2B bus

On a hardware level the nodes are enumerated through a process known as
discovery. To discover each subordinate node, the main node is
programmed to send a "discovery" packet downstream on the bus to find
the next undiscovered subordinate node. In the diagram above this would
happen N times, and the N+1'th discovery packet would not get a
response, so the driver infrastructure can then conclude that there are
N nodes connected.

Each node in the bus, including the main node, binds to an A2B node
Linux driver. The A2B node driver implements bus-level operations which
are necessary to enact the discovery of any next-in-line subordinate
nodes. That means that to discover a subordinate node N, the node N-1
must have successfully bound to its node drvier. This is much like a
supplier-consumer relationship in the context of managed device links:
for node N to bind to its driver, node N-1 must be bound, and if node N
is unbound, node N-1 must be unbound first.

The original code submission in [1] implements this discovery process in
the A2B driver core for each instance of an A2B bus: a struct device for
the main node is registered automatically because it is trivially
present and requires no discovery. After every node device - including
the main node - binds to its A2B node driver, the core will then
transmit a discovery packet and register another struct device for a
subordinate node if the operation is successful. In this way I managed
to ensure that node devices are probed in the correct order.

The hardware also supports detection of bus drops - disconnection events
- between the nodes. If such an event is received on node N, it means
that node N+1 has been disconnected from the bus. The implementation [1]
will then unregister all devices corresponding to nodes N+1 and greater.

The implementation [1] is rather bespoke and presented several
challenges which I am now trying to address. In the same link [1] under
the HELP WANTED header I have described two of these issues which will
give you a bit of context:

1. The first problem relates to the device links that are
automatically created as a result of the device tree topology,
which prevent my sound card from probing without use of the
post-init-providers property. This is an issue in one of my
company's use cases, a picture of which you can see on page 20 of
[2]. Namely, I have A2B nodes described in device tree which may
not be present depending on how large a speaker setup the customer
is running. I haven't found an alternative solution to this besides
to use post-init-providers.

2. The second and more pressing issue is that in some other hardware I
am working on, some nodes require additional resources to be
configured before they can be discovered: think regulator supplies
and muxes, for example. In an ordinary driver model this would be
acquired in a device driver's probe function, but since the
implementation [1] is performing discovery before even creating the
struct device in question, I cannot proceed in that way.

As a result of (2) above, I am now trying to take an approach I have
seen more commonly in other subsystems, namely, to parse the device tree
when an A2B bus is registered and gratuitously create struct devices for
each device tree node describing an A2B node. I2C does this for
example. The discovery process is then placed in an API that an A2B node
driver's probe function can call when it has configured all other
necessary resources first (e.g. muxes). In the case that a node is not
present, I want its driver's probe function to simply return
-ENODEV. Then, even in setups where not all nodes are connected, I don't
get the missing nodes' devices stuck in the deferred probe queue
forever.

For this I have tried to use device links, but with limited
success. What I would like to describe is managed device links between
nodes where node N is a supplier of node N+1. When registering all node
devices at once, the driver core will then ensure that they are probed
in the correct order, i.e. that node N has bound before node N+1 gets
probed.

I have encountered two problems with the new approach:

1. In the event that node N is not present and its driver's probe
function returns -ENODEV, nodes N+1 then never get probed, and
remain in the deferred probe queue forever. This seems incorrect
because it is a completely valid hardware setup from our point of
view.

Reading the documentation, I thought that I should use
DL_FLAG_AUTOREMOVE_SUPPLIER to indicate that the device link should
be deleted when the supplier's driver (a) unbinds or (b) fails to
probe. In that way the probe of downstream nodes > N will still be
attempted when node N fails to probe (due to its absence, -ENODEV).
But it appears that only (a) is actually implemented? i.e. if node
N fails to probe with -ENODEV, I still get the same problem above
that nodes N+1 get stuck in the deferred probe queue.

I'm able to work around this problem by creating
DL_FLAG_AUTOREMOVE_CONSUMER device links later on between consumer
N+1 and supplier N when node N+1's probe function succeeds. In this
case the initial probe of nodes is unordered, but I rely on other
heuristics within the driver design to detect whether or not an
upstream node has been bound (in which case I return -EPROBE_DEFER)
or is missing (in which case I return -ENODEV). The device link
then helps when node drivers are unbound by ensuring that all
downstream node devices are also unbound from their drivers in the
correct order (furtherst to nearest). This leads to quite a few
spurious -EPROBE_DEFERs and some bespoke logic inside my A2B driver
core code, but if you think it's OK, then I can live with it.

One side note is that I appear to be hitting the same WARNING that
Herve tried to address here [3]:

[3] https://lore.kernel.org/all/20231110080241.702999-1-herve.codina@xxxxxxxxxxx/

For reasons of practicality I am prototyping on an older kernel and
haven't empirically checked if this is fixed on the latest kernel,
but so far as I can see, the patch was never applied. Is this a
known issue?

2. In the event of a bus drop detection at node N, I want Linux to
respond by unbinding all downstream nodes > N. Right now I do this
in the A2B driver core but I wonder if this is correct.

Worse still, I have to support a specific use-case where nodes drop
off the bus during setup (due to firmware flashing of another
peripheral and a hardware reset) and then I want to rediscover
them. The specific scenario is described in the comments for this
custom A2B node driver [4]:

[4] https://lore.kernel.org/all/20240517-a2b-v1-12-b8647554c67b@xxxxxxxxxxxxxxx/

The way it's handled is that for these specific pieces of hardware,
a custom A2B node driver will first perform the firmware flashing
within the probe function, then initiate a hardware reset which the
driver knows will induce a bus drop. The driver handles this by
simply returning -EPROBE_DEFER until the hardware is in the ready
state.

In the old design this was handled somewhat naturally because I
would delete the corresponding struct device when it drops off the
bus, and then schedle deferred work to reinitiate the discovery
process. When it comes back, a new struct device is registered and
the driver core would probe it for me automatically.

But with the new design I am trying, I have to instead handle this
by explicitly unbinding the drivers and leaving the struct devices
around, and then schedule a manual re-bind attempt. I am not
completely sure how safe or correct this is to do from a driver
model perspective.

I am using device tree to describe the hardware because I am working on
embedded ARM and using subsystems such as ASoC which all use it as
well. But the hardware is dynamic and somewhat[*] hot-pluggable by
nature so I am interested in making the driver design amenable to such
situations as well. To that end I am still not completely sold on this
idea of gratuitously creating struct devices just because there's a
device tree node. If future extension of the driver implemented OF-less
dynamic enumeration, one would expect that it would create struct
devices only in the event of a successful discovery. But I am hitting
the limits of my experience with the driver model here and would need to
bounce some ideas.

[*] There is no built-in plug detection, only bus drop (unplug). But
plug detection could in principle be handled out-of-band or via
polling.

I would be grateful if you have any advice or thoughts on the
matter. Basically I want to implement things in a way which is
commensurate with the driver model and not too unorthodox. But the
dynamism of the hardware is causing me a bit of a headache!

Thank you in advance!

Kind regards,
Alvin