Re: [PATCH 15/24] thunderbolt: Rework control channel to be more reliable
From: Greg Kroah-Hartman
Date: Thu May 25 2017 - 09:26:01 EST
On Thu, May 18, 2017 at 05:39:05PM +0300, Mika Westerberg wrote:
> If a request times out the response might arrive right after the request
> is failed. This response is pushed to the kfifo and next request will
> read it instead. Since it most likely will not pass our validation
> checks in parse_header() the next request will fail as well, and
> response to that request will be pushed to the kfifo, ad infinitum.
>
> We end up in a situation where all requests fail and no devices can be
> added anymore until the driver is unloaded and reloaded again.
>
> To overcome this, rework the control channel so that we will have a
> queue of outstanding requests. Each request will be handled in turn and
> the response is validated against what is expected. Unexpected packets
> (for example responses for requests that have been timed out) are
> dropped. This model is copied from Greybus implementation with small
> changes here and there to get it cope with Thunderbolt control packets.
>
> In addition the configuration packets support sequence number which the
> switch is supposed to copy from the request to response. We use this to
> drop responses that are already timed out. Taking advantage of the
> sequence number, we automatically retry configuration read/write 4 times
> before giving up.
>
> Also timeout is not a programming error so there is no need to trigger a
> scary backtrace (WARN), instead we just log a warning. After all
> Thunderbolt devices are hot-pluggable by definition which means user can
> unplug a device any time and that is totally acceptable.
>
> With this change there is no need to take the global domain lock when
> sending configuration packets anymore. This is useful when we add
> support for cross-domain (XDomain) communication later on.
>
> Signed-off-by: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>
> Reviewed-by: Yehezkel Bernat <yehezkel.bernat@xxxxxxxxx>
> Reviewed-by: Michael Jamet <michael.jamet@xxxxxxxxx>
> ---
> drivers/thunderbolt/ctl.c | 471 +++++++++++++++++++++++++++++++++++++++-------
> drivers/thunderbolt/ctl.h | 65 +++++++
> drivers/thunderbolt/tb.h | 2 +-
> 3 files changed, 467 insertions(+), 71 deletions(-)
>
> diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c
> index b3ee755fdb37..2b1255cbf3c9 100644
> --- a/drivers/thunderbolt/ctl.c
> +++ b/drivers/thunderbolt/ctl.c
> @@ -5,22 +5,17 @@
> */
>
> #include <linux/crc32.h>
> +#include <linux/delay.h>
> #include <linux/slab.h>
> #include <linux/pci.h>
> #include <linux/dmapool.h>
> #include <linux/workqueue.h>
> -#include <linux/kfifo.h>
>
> #include "ctl.h"
>
>
> -struct ctl_pkg {
> - struct tb_ctl *ctl;
> - void *buffer;
> - struct ring_frame frame;
> -};
> -
> -#define TB_CTL_RX_PKG_COUNT 10
> +#define TB_CTL_RX_PKG_COUNT 10
> +#define TB_CTL_RETRIES 4
>
> /**
> * struct tb_cfg - thunderbolt control channel
> @@ -32,8 +27,9 @@ struct tb_ctl {
>
> struct dma_pool *frame_pool;
> struct ctl_pkg *rx_packets[TB_CTL_RX_PKG_COUNT];
> - DECLARE_KFIFO(response_fifo, struct ctl_pkg*, 16);
> - struct completion response_ready;
> + struct mutex request_lock;
> + struct list_head request_queue;
> + bool running;
>
> event_cb callback;
> void *callback_data;
> @@ -55,10 +51,115 @@ struct tb_ctl {
> #define tb_ctl_dbg(ctl, format, arg...) \
> dev_dbg(&(ctl)->nhi->pdev->dev, format, ## arg)
>
> +static DECLARE_WAIT_QUEUE_HEAD(tb_cfg_request_cancel_queue);
> +
> +/**
> + * tb_cfg_request_alloc() - Allocates a new config request
> + *
> + * This is refcounted object so when you are done with this, call
> + * tb_cfg_request_put() to it.
> + */
> +struct tb_cfg_request *tb_cfg_request_alloc(void)
> +{
> + struct tb_cfg_request *req;
> +
> + req = kzalloc(sizeof(*req), GFP_KERNEL);
> + if (!req)
> + return NULL;
> +
> + kref_init(&req->kref);
> +
> + return req;
> +}
> +
> +/**
> + * tb_cfg_request_get() - Increase refcount of a request
> + * @req: Request whose refcount is increased
> + */
> +void tb_cfg_request_get(struct tb_cfg_request *req)
> +{
> + kref_get(&req->kref);
> +}
> +
> +static void tb_cfg_request_destroy(struct kref *kref)
> +{
> + struct tb_cfg_request *req = container_of(kref, typeof(*req), kref);
> +
> + kfree(req);
> +}
> +
> +/**
> + * tb_cfg_request_put() - Decrease refcount and possibly release the request
> + * @req: Request whose refcount is decreased
> + *
> + * Call this function when you are done with the request. When refcount
> + * goes to %0 the object is released.
> + */
> +void tb_cfg_request_put(struct tb_cfg_request *req)
> +{
> + kref_put(&req->kref, tb_cfg_request_destroy);
> +}
What prevents this call from being called twice on the same object from
different threads at the same time? You still need a lock somewhere to
protect yourself from that, am I just missing where that lock is?
thanks,
greg k-h