Re: [PATCH net-next v9 2/2] net: ti: icssg_prueth: add TAPRIO offload support

From: Vladimir Oltean
Date: Thu Jun 06 2024 - 10:37:22 EST


On Thu, Jun 06, 2024 at 04:33:58PM +0530, MD Danish Anwar wrote:
> >>>> +static void tas_reset(struct prueth_emac *emac)
> >>>> +{
> >>>> + struct tas_config *tas = &emac->qos.tas.config;
> >>>> + int i;
> >>>> +
> >>>> + for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
> >>>> + tas->max_sdu_table.max_sdu[i] = 2048;
> >>>
> >>> Macro + short comment for the magic number, please.
> >>>
> >>
> >> Sure I will add it. Each elements in this array is a 2 byte value
> >> showing the maximum length of frame to be allowed through each gate.
> >
> > Is the queueMaxSDU[] array active even with the TAS being in the reset
> > state? Does this configuration have any impact upon the device MTU?
> > I don't know why 2048 was chosen.
>
> I talked to the firmware team. The value of 248 is actually wrong. It
> should be the device mtu only i.e. PRUETH_MAX_MTU.

There was another comment about the value of 0, sent separately.

> > If you're replacing an existing active schedule with a shadow one, the
> > ICSSG_EMAC_PORT_TAS_ENABLE command isn't needed because the TAS is
> > already enabled on the port, right? In fact it will be suppressed by
> > tas_set_state() without even generating an emac_set_port_state() call,
> > right?
> >
>
> As this point TAS is not enabled. TAS is enabled on the port only when
> ICSSG_EMAC_PORT_TAS_ENABLE is sent. Which happens at the end of
> emac_taprio_replace().

"If you're replacing an existing active schedule" => emac_taprio_replace()
was already called once, and we're calling it again, with no emac_taprio_destroy()
in between.

This is done using the "tc qdisc replace" command. You can keep the
mqprio parameters the same, just change the schedule parameters.
The transition from the old to the new schedule is supposed to be
seamless and at a well-defined time, according to the IEEE definitions.

> >> The following three offsets are configured in this function,
> >> 1. TAS_ADMIN_CYCLE_TIME → admin cycle time
> >> 2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
> >> admin list is taken as operating list.
> >> This parameter is calculated based on the base_time, cur_time and
> >> cycle_time. If the base_time is in past (already passed) the
> >> TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
> >> future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
> >> DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
> >> 3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.
> >>
> >> After configuring the above three parameters, the driver gives the
> >> trigger signal to the firmware using the R30 command interface with
> >> ICSSG_EMAC_PORT_TAS_TRIGGER command.
> >>
> >> The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
> >> are relative to time remaining in the base_time from now i.e. base_time
> >> - cur_time.
> >
> > So you're saying that the firmware executes the schedule switch at
> >
> > now + TAS_ADMIN_CYCLE_TIME * TAS_CONFIG_CHANGE_CYCLE_COUNT ns
> > ~~~
> > time of reception of
> > ICSSG_EMAC_PORT_TAS_TRIGGER
> > R30 command
> >
> > ?
> >
>
> I talked to the firmware team on this topic. Seems like this is actually
> a bug in the firmware design. This *now* is very relative and it will
> always introduce jitter as you have mentioned.
>
> The firmware needs to change to handle the below two cases that you have
> mentioned.
>
> The schedule should start at base-time (given by user). Instead of
> sending the cycle count from now to base-time to firmware. Driver should
> send the absolute cycle count corresponding to the base-time. Firmware
> can then check the curr cycle count and when it matches the count set by
> driver firmware will start scheduling.
>
> change_cycle_count = base-time / cycle-time;
>
> This way the irregularity with *now* will be removed. Now even if we run
> the same command on two different ICSSG devices(whose clocks are synced
> with PTP), the scheduling will happen at same time.
>
> As the change_cycle_count will be same for both of them. Since the
> clocks are synced the current cycle count (read from
> TIMESYNC_FW_WC_CYCLECOUNT_OFFSET) will also be same for both the devices

You could pass the actual requested base-time to the firmware, and let
the firmware calculate a cycle count or whatever the hardware needs.
Otherwise, you advance the base-time in the driver into what was the
future at the time, but by the time the r30 command reaches the
firmware, the passed number of cycles has already elapsed.

> > I'm not really interested in how the driver calculates the cycle count,
> > just in what are the primitives that the firmware ABI wants.
> >
> > Does the readb_poll_timeout() call from tas_update_oper_list() actually
> > wait until this whole time elapses? It is user space input, so it can
> > keep a task waiting in the kernel, with rtnl_lock() acquired, for a very
> > long time if the base_time is far away in the future.
> >
>
> readb_poll_timeout() call from tas_update_oper_list() waits for exactly
> 10 msecs. Driver send the trigger_list_change command and sets
> config_change register to 1 (details in tas_set_trigger_list_change()).
> Driver waits for 10 ms for firmware to clear this register. If the
> register is not cleared, list wasn't changed by firmware. Driver will
> then return err.

And the firmware clears this register when? Quickly upon reception of
the TAS_TRIGGER command, or after the TAS is actually triggered (after
change_cycle_count cycles)?

> > 2. You cannot apply a phase offset between the schedules on two ICSSG
> > devices in the same network.
> >
> > Since there is a PHY-dependent propagation delay on each link, network
> > engineers typically delay the schedules on switch ports along the path
> > of a stream.
> >
> > Say for example there is a propagation delay of 800 ns on a switch with
> > base-time 0. On the next switch, you could add the schedule like this:
> >
> > tc qdisc replace dev swp0 parent root taprio \
> > num_tc 8 \
> > map 0 1 2 3 4 5 6 7 \
> > queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
> > base-time 800 \
> > sched-entry S 0x81 100000 \
> > sched-entry S 0x01 900000 \
> > flags 0x2 \
> > max-sdu 0 0 0 0 0 0 0 79
> >
> > Same schedule, phase-shifted by 800 ns, so that if the packet goes
> > through an open gate in the first switch, it will also go through an
> > open gate through the second.
> >
> > According to your own calculations and explanations, the firmware ABI
> > makes no difference between base-time 0 and base-time 800.
> >
>
> In the new implementation base-time 0 and base-time 800 will make a
> difference. as the change_cycle_count will be different from both the cases.
> In case of base-time 0, change_cycle_count will be 1. Implying schedule
> will start on the very next cycle.
>
> In case of base-time 800, change_cycle_count will be 800 / cycle-time.

In this example, cycle-time is (much) larger than 800 ns, so 800 / cycle-time is 0.
Simply put, base-time 0 and base-time 800 will still be treated equally,
if the firmware only starts the schedule upon integer multiples of the
cycle time. A use case is offsetting schedules by a small value, smaller
than the cycle time.

The base-time value of 800 should be advanced by the smallest integer
multiple of the cycle-time that satisfies the inequality
new-base-time = (base-time + N * cycle-time) >= now.

You can see that for the same value of N and cycle-time, new-base-time
will different when base-time = 0 vs when base-time = 800. Taprio
expects that difference to be reflected into the schedule.

> > In this case they are probably both smaller than the current time, so
> > TAS_CONFIG_CHANGE_CYCLE_COUNT will be set to the same "1" in both cases.
> >
>
> If cycle-time is larger then both 0 and 800 then the change_cycle_count
> would be 1 in both the cases.
>
> > But even assuming a future base-time, it still will make no difference.
> > The firmware seems to operate only on integer multiples of a cycle-time
> > (here 1000000).
>
> Yes, the firmware works only on multiple of cycle time. If the base-time
> is not a multiple of cycle time, the scheduling will start on the next
> cycle count.
>
> i.e. change_cycle_count = ceil (base-time / cycle-time)
> > Summarized, the blocking problems I see are:
> >
> > - For issue #2, the driver should not lie to the user space that it
> > applied a schedule with a base-time that isn't a precise multiple of
> > the cycle-time, because it doesn't do that.
> >
>
> Yes, I acknowledge it's a limitation. Driver can print "requested
> base-time is not multiple of cycle-time, secheduling will start on the
> next available cycle from base-time". I agree the driver shouldn't lie
> about this. Whenever driver encounters a base time which is not multiple
> of cycle-time. It can still do the scheduling but throw a print so that
> user is aware of this.

Is that a firmware or a hardware limitation? You're making it sound as
if we shouldn't be expecting for it to be lifted.

> > - For issue #1, the bigger problem is that there is always a
> > software-induced jitter which makes whatever the user space has
> > requested irrelevant.
> >
>
> As a I mentioned earlier, the new implementation will take care of this.
>
> I will work with the firmware team to get this fixed. Once that's done I
> will send a new revision.
>
> Thanks for all the feedbacks. Please let me know if some more
> clarification is needed.

Ok, so we're waiting for a new firmware release, and a check in the
driver that the firmware version >= some minimum requirement, else
-EOPNOTSUPP?