Sending again due to accidental HTML.Yes, that will not change with the introduction of this change.
On XXXXX, Siddharth Gupta wrote:
On 5/24/2021 8:03 PM, Bjorn Andersson wrote:In that case, the glink has to be prepared/started before sysmon, right?
On Mon 17 May 18:08 CDT 2021, Siddharth Gupta wrote:My understanding of the topic was that each subdevice should be
Subdevices at the beginning of the subdev list should haveThe subdev lists layers of the communication onion, we bring them up
higher priority than those at the end of the list. Reverse
traversal of the list causes priority inversion, which can
impact the performance of the device.
inside out and we take them down outside in.
This stems from the primary idea that we want to be able to shut things
down cleanly (in the case of a stop) and we pass the "crashed" flag to
indicate to each recipient during "stop" that it may not rely on the
response of a lower layer.
As such, I don't think it's right to say that we have a priority
inversion.
independent of the other. In our case unfortunately the sysmon
subdevice depends on the glink endpoint.
I made a mistake earlier, QMI is the layer that creates a QRTR
However the priority inversion doesn't happen in theseSo why can't the QMI layer be smart about this and check that the
subdevices, it happens due to the SSR notifications that we send
to kernel clients. In this case kernel clients also can have QMI
sockets that in turn depend on the glink endpoint, which means
when they go to release the QMI socket a broadcast will be sent
out to all connected clients about the closure of the connection
which in this case happens to be the remoteproc which died. So
if we peel the onion, we will be unnecessarily be waiting for a
dead remoteproc.
remoteproc hasn't crashed before you try to communicate with it? Or if
the glink is torn down before QMI gets to broadcast, then it's a
pretty clear indication of failure and just notify all the kernel side
QMI clients?
The subdevice concept serves the purpose of informing devices
Why not just query the current state of the remote proc before tryingHere the glink device on the rpmsg bus won't know about theFor example a device adds the glink, sysmon and ssr subdevsIn general the design is such that components are not expected to
to its list. During a crash the ssr notification would go
before the glink and sysmon notifications. This can cause a
degraded response when a client driver waits for a response
from the crashed rproc.
communicate with the crashed remote when "crashed" is set, this avoids
the single-remote crash.
crashed remoteproc till we send glink notification first, right?
to talk to it? It should be a quick check.
I will agree with you if you call this change ugly (because it
Since we send out sysmon and SSR notifications first, the glinkI think the main problem you have here is rproc sub devices that
device will still be "alive" on the rpmsg bus.
The case where this isn't holding up is when two remote processorsYou are right, the window would become smaller in the case of two
crashes simultaneously, in which case e.g. sysmon has been seen hitting
its timeout waiting for an ack from a dead remoteproc - but I was under
the impression that this window shrunk dramatically as a side effect of
us fixing the notification ordering.
remoteprocs, but this issue can come up with even a single
remoteproc unless prioritize certain subdevices.
depend on other rproc sub devices. But there's no dependency tracking
here. Your change just happens to work for your specific case because
the order of the sub devices in the list happens to work for your
inter-subdevice dependencies. But this is definitely not going to work
for all users of subdevices.
If keeping track of dependency is too much complexity (I haven't read
enough rproc code to comment on that), at the least, it looks like you
need another ops instead of changing the order of stop() callbacks. Or
at a minimum pick the ordering based on the "crashed" flag. A blanket,
I'll just switch the ordering of stop() for everyone for all cases is
wrong.
In the case of a normal/clean shutdown the users of the
In fact, in the normal/clean shutdown case, I'd think you'll want to
stop the subdevices in reverse initialization order so that you can
cleanly stop QMI/sysmon first before shutting down glink.
-Saravana