Re: [PATCH 1/9] platform/surface: Add Surface Aggregator subsystem
From: Barnabás Pőcze
Date: Thu Nov 19 2020 - 10:54:55 EST
Hi
2020. november 19., csütörtök 0:06 keltezéssel, Maximilian Luz <luzmaximilian@xxxxxxxxx> írta:
> [...]
> >> +/**
> >> + * ssam_nf_head_destroy() - Deinitialize the given notifier head.
> >> + * @nh: The notifier head to deinitialize.
> >> + */
> >> +static void ssam_nf_head_destroy(struct ssam_nf_head *nh)
> >> +{
> >> + cleanup_srcu_struct(&nh->srcu);
> >> +}
> >
> > I'm also wondering if there's any reason why these static one-liner functions are
> > not explicitly marked inline.
>
> Isn't inline more of a linkage keyword at this point? I fully expect any
> compiler to inline things like this automatically at the first hint of
> an optimization flag.
>
I also expect the compiler to inline such functions even without the `inline`
specifier. In retrospect I'm not sure why I actually made this comment, possibly
because of my preference, but I believe it's fine either way, so my comment is
moot, sorry.
> [...]
> >> + * error values via ssam_notifier_from_errno() to notifier values.
> >> + *
> >> + * Also note that any callback that could handle an event should return a value
> >> + * with bit %SSAM_NOTIF_HANDLED set, indicating that the event does not go
> >> + * unhandled/ignored. In case no registered callback could handle an event,
> >> + * this function will emit a warning.
> >> + *
> >> + * In case a callback failed, this function will emit an error message.
> >> + */
> >> +static void ssam_nf_call(struct ssam_nf *nf, struct device *dev, u16 rqid,
> >> + struct ssam_event *event)
> >> +{
> >> + struct ssam_nf_head *nf_head;
> >> + int status, nf_ret;
> >> +
> >> + if (!ssh_rqid_is_event(rqid)) {
> >> + dev_warn(dev, "event: unsupported rqid: 0x%04x\n", rqid);
> >
> > A small note, "%#04x" would insert the "0x" prefix.
>
> Oh neat, didn't know that. Should be "%#06x" though, right?
>
Yes, indeed, sorry, it should be "%#06x".
> [...]
> >> +static int ssam_controller_caps_load_from_acpi(
> >> + acpi_handle handle, struct ssam_controller_caps *caps)
> >> +{
> >> + u32 d3_closes_handle = false;
> >
> > Assinging a boolean like this to a `u32` looks very odd to me.
>
> The value returned by the corresponding DSM call is an integer, but
> essentially only contains a bool value (zero vs. one). I thought using
> false here explicitly for initialization helps clarify that. That also
> makes it consistent with the "caps->d3_closes_handle = false" line
> below.
>
Although I still find it pretty odd, your explanation makes sense to me.
> >> +int ssam_request_sync_alloc(size_t payload_len, gfp_t flags,
> >> + struct ssam_request_sync **rqst,
> >> + struct ssam_span *buffer)
> >> +{
> >> + size_t msglen = SSH_COMMAND_MESSAGE_LENGTH(payload_len);
> >> +
> >> + *rqst = kzalloc(sizeof(**rqst) + msglen, flags);
> >> + if (!*rqst)
> >> + return -ENOMEM;
> >> +
> >> + buffer->ptr = (u8 *)(*rqst + 1);
> >> + buffer->len = msglen;
> >> +
> >> + return 0;
> >> +}
> >
> > I think there is a bit of incosistency: sometimes you use ** pointer + return int,
> > sometimes you return a pointer with potentially embedded errno. I think it would
> > be better if you stuck with one or the other.
>
> My rule of thumb is: If it's one output parameter, use a pointer as
> return value. If it's two or more output parameters (as in this case,
> "buffer" is written to), use two pointer arguments and an int as return.
>
> I noticed that ssam_client_bind doesn't follow that scheme so I'll
> change that.
>
I see, thanks for the explanation, what prompted me to make this comment
was that, as you noted, there were functions with one output parameter that
were not consistent.
> [...]
> >> + * Note that the packet completion callback is, in case of success and for a
> >> + * sequenced packet, guaranteed to run on the receiver thread, thus providing
> >> + * a way to reliably identify responses to the packet. The packet completion
> >> + * callback is only run once and it does not indicate that the packet has
> >> + * fully left the system (for this, one should rely on the release method,
> >> + * triggered when the reference count of the packet reaches zero). In case of
> >> + * re-submission (and with somewhat unlikely timing), it may be possible that
> >> + * the packet is being re-transmitted while the completion callback runs.
> >> + * Completion will occur both on success and internal error, as well as when
> >> + * the packet is canceled.
> >
> > If I understand it correctly, it is possible that submission of a packet fails
> > for the first time, but it's scheduled for resubmission, and this retransmission
> > happens at the same time when the complete() callback is called. If that's the
> > case, then the callback is called with an error condition, no? Thus it is possible
> > that a packet is successfully submitted (for the second, third, etc. time), but the
> > complete() callback receives notification about failure? Or am I missing something?
>
> Not sure I can follow you here. What do you mean by "fails for the first time"?
>
> A couple of notes before I try to explain the error cases:
>
> - Resubmission only happens on timeout or NAK.
>
> - Only sequenced packets can be resubmitted (unsequenced are not added
> to the pending set, thus do not have a packet timeout and do not
> react to NAKs).
>
> - Completion means the packet gets locked, which prevents it from being
> resubmitted and retransmitted, and is removed from all collections.
>
> - Completion cannot be triggered by NAK. In case a NAK is received and
> all tries have been exceeded, the packet still waits for the timeout
> (or ACK, whatever happens first) one last time.
>
> The error cases:
>
> - serdev: Failure to send via serdev results in immediate completion
> with error on the transmitter thread. Due to a NAK (or a _very_
> unlikely timeout), the packet can be queued for a second time at that
> point, but, as completion runs on the transmitter thread in this
> case, the packet is removed from the queue before the thread goes on
> to get the next packet.
>
> - Timeout: On resubmission, the timeout of a packet is disabled. It is
> (re-)armed directly before starting transmission on the transmitter
> thread.
>
> What _could_ now happen is that, for some reason, the transmitter
> thread waits the full timeout period between arming the timeout and
> actually sending it (which I'd consider highly unlikely). Note that
> "actually sending it" includes whatever asynchronous stuff the serdev
> core does after serdev_device_write_buf() returns.
>
> In that case, the packet can be transmitted _after_ it has received a
> timeout. So assuming this is the last try, then the packet can be
> sent after completion with -ETIMEDOUT.
>
> I believe this is what you meant?
>
Yes, that's what I meant, thanks for the clarification and broader explanation
about the things at play here.
> Again, I consider this highly unlikely as this would require either
> the scheduler to pause our thread for a full timeout period or
> require the serdev core to basically do nothing for a full timeout
> period. I'd argue this is a lot less likely than a dropped ACK packet
> from the EC (which results in the same thing: EC receives data but we
> think it didn't, causing a timeout error).
>
> Note that putting the timeout (re-)arming step after the last call to
> serdev_device_write_buf() doesn't really help either, as that only
> writes to the transport buffer. If we would want to guarantee that
> this doesn't happen, we'd have to explicitly wait for the hardware to
> finish transmitting, which essentially breaks the whole point of
> buffering.
>
> Also I prefer to have the timeout cover the transmission part as
> well. In the very least, this simplifies reasoning about it once I've
> restricted it to be (re-)set under the pending lock only (as
> mentioned in the message to Andy).
>
> - Finally, there is completion by cancellation or shutdown, for which
> the packet can have been (or currently be) transmitted successfully
> but completion returns with -ECANCELLED or -ESHUTDOWN, respectively.
> This is the expected behavior in those cases. Again, the completion
> procedure takes care of locking the packet and removing all
> references from the collections.
>
> Also there's one success case, when we receive an ACK after a timeout
> and during the re-transmission. Then the packet is completed with
> success, but might still be transmitted once more. The EC should detect
> that packet as re-submission via its packet ID and ignore it.
>
> >> [...]
> >
> >> +int ssh_ptl_rx_rcvbuf(struct ssh_ptl *ptl, const u8 *buf, size_t n)
> >> +{
> >> + int used;
> >> +
> >> + if (test_bit(SSH_PTL_SF_SHUTDOWN_BIT, &ptl->state))
> >> + return -ESHUTDOWN;
> >> +
> >> + used = kfifo_in(&ptl->rx.fifo, buf, n);
> >
> > Isn't it possible that `n` is greater than the free space in the fifo? What
> > happens then?
>
> If I've read the documentation of kfifo_in() right, it takes at most n
> elements. If n is greater than the free space, it will only take as much
> elements as it has space. The return value of kfifo_in() is the number
> of elements actually consumed. We simply return that here (as elements =
> bytes) and ultimately let the serdev/tty core take care of handling
> that.
>
Ahh, sorry, you're right.
> >> + if (used)
> >> + ssh_ptl_rx_wakeup(ptl);
> >> +
> >> + return used;
> >> +}
> [...]
> >> + ((struct ssh_frame *)sf.ptr)->len);
> >
> > Why isn't `get_unaligned_le16()` used here? (Or simply even `sp.len`.)
>
> This should actually be "sp.len + SSH_MESSAGE_LENGTH(0)", and also
> format spezifier %zu with that as SSH_MESSAGE_LENGTH returns size_t.
>
Yes, indeed, sorry, I missed the `+ SSH_MESSAGE_LENGTH(0)` part.
> >> + return -EMSGSIZE;
> >> + }
> >> [...]
> >> + *frame = (struct ssh_frame *)sf.ptr;
> >
> > This also violates strict aliasing.
>
> Sure, the whole point of this function (and ssh_parse_command below) is
> to get pointers to the data structures in the given buffer, i.e. create
> aliases of the right types to the right places in the buffer. Not sure
> how I'd do that differently, apart from copying.
>
> I think that this would even work properly (in practice) without setting
> -fno-strict-aliasing, as all data at this point is effectively
> read-only. Furthermore, after the call access is restricted to the
> non-aliasing parts respectively (frame and payload), whereas the
> underlying buffer they do alias isn't accessed for the rest of that
> scope (i.e. until the other references don't exist any more).
>
> Anyway, as far as I know the kernel is built with -fno-strict-aliasing.
> Please correct me if I'm wrong or relying on aliasing is frowned upon.
>
I don't believe it could cause issue here, I just wanted to note it.
> [...]
> >> +enum ssam_ssh_tc {
> >> + /* Known SSH/EC target categories. */
> >> + // category 0x00 is invalid for EC use
> >> + SSAM_SSH_TC_SAM = 0x01, // generic system functionality, real-time clock
> >> + SSAM_SSH_TC_BAT = 0x02, // battery/power subsystem
> >> + SSAM_SSH_TC_TMP = 0x03, // thermal subsystem
> >> + SSAM_SSH_TC_PMC = 0x04,
> >> + SSAM_SSH_TC_FAN = 0x05,
> >> + SSAM_SSH_TC_PoM = 0x06,
> >> + SSAM_SSH_TC_DBG = 0x07,
> >> + SSAM_SSH_TC_KBD = 0x08, // legacy keyboard (Laptop 1/2)
> >> + SSAM_SSH_TC_FWU = 0x09,
> >> + SSAM_SSH_TC_UNI = 0x0a,
> >> + SSAM_SSH_TC_LPC = 0x0b,
> >> + SSAM_SSH_TC_TCL = 0x0c,
> >> + SSAM_SSH_TC_SFL = 0x0d,
> >> + SSAM_SSH_TC_KIP = 0x0e,
> >> + SSAM_SSH_TC_EXT = 0x0f,
> >> + SSAM_SSH_TC_BLD = 0x10,
> >> + SSAM_SSH_TC_BAS = 0x11, // detachment system (Surface Book 2/3)
> >> + SSAM_SSH_TC_SEN = 0x12,
> >> + SSAM_SSH_TC_SRQ = 0x13,
> >> + SSAM_SSH_TC_MCU = 0x14,
> >> + SSAM_SSH_TC_HID = 0x15, // generic HID input subsystem
> >> + SSAM_SSH_TC_TCH = 0x16,
> >> + SSAM_SSH_TC_BKL = 0x17,
> >> + SSAM_SSH_TC_TAM = 0x18,
> >> + SSAM_SSH_TC_ACC = 0x19,
> >> + SSAM_SSH_TC_UFI = 0x1a,
> >> + SSAM_SSH_TC_USC = 0x1b,
> >> + SSAM_SSH_TC_PEN = 0x1c,
> >> + SSAM_SSH_TC_VID = 0x1d,
> >> + SSAM_SSH_TC_AUD = 0x1e,
> >> + SSAM_SSH_TC_SMC = 0x1f,
> >> + SSAM_SSH_TC_KPD = 0x20,
> >> + SSAM_SSH_TC_REG = 0x21,
> >> +};
> >
> > Is it known what these abbreviations stand for? Maybe I missed them?
>
> The comments state all we really know about these (through observation
> and experimentation). The table itself has been extracted from the
> Windows driver, but the abbreviations and values are all we're getting
> from it.
I see, thanks for the clarification. For some reason, I believed the
"Known SSH/EC target categories" comment means that those are known, as in,
it is known what they are for, etc., not just the abbreviation.
> [...]
Regards,
Barnabás Pőcze