Re: [PATCH 1/3] Add TX sending hardware timestamp.

From: Geva, Erez
Date: Fri Dec 11 2020 - 09:23:59 EST



On 11/12/2020 00:30, Willem de Bruijn wrote:
>>> If I understand correctly, you are trying to achieve a single delivery time.
>>> The need for two separate timestamps passed along is only because the
>>> kernel is unable to do the time base conversion.
>>
>> Yes, a correct point.
>>
>>>
>>> Else, ETF could program the qdisc watchdog in system time and later,
>>> on dequeue, convert skb->tstamp to the h/w time base before
>>> passing it to the device.
>>
>> Or the skb->tstamp is HW time-stamp and the ETF convert it to system clock based.
>>
>>>
>>> It's still not entirely clear to me why the packet has to be held by
>>> ETF initially first, if it is held until delivery time by hardware
>>> later. But more on that below.
>>
>> Let plot a simple scenario.
>> App A send a packet with time-stamp 100.
>> After arrive a second packet from App B with time-stamp 90.
>> Without ETF, the second packet will have to wait till the interface hardware send the first packet on 100.
>> Making the second packet late by 10 + first packet send time.
>> Obviously other "normal" packets are send to the non-ETF queue, though they do not block ETF packets
>> The ETF delta is a barrier that the application have to send the packet before to ensure the packet do not tossed.
>
> Got it. The assumption here is that devices are FIFO. That is not
> necessarily the case, but I do not know whether it is in practice,
> e.g., on the i210.
>
>>
>>>
>>> So far, the use case sounds a bit narrow and the use of two timestamp
>>> fields for a single delivery event a bit of a hack.
>>
>> The definition of a hack is up to you
>
> Fair enough :) That wasn't very constructive feedback on my part.
>
>>> And one that does impose a cost in the hot path of many workloads
>>> by adding a field the ip cookie, cork and writing to (possibly cold)
>>> skb_shinfo for every packet.
>>
>> Most packets do not use skb->tstamp either, probably the cost of testing is higher then just copying.
>> But perhaps if we copy 2 time-stamp we can add a condition for both.
>> What do you think?
>
> I'd need to take a closer look at the skb_hwtstamps, which unlike
> skb->tstamp lie in the skb_shared_data. If that is an otherwise cold
> cacheline, then access would be expensive.
Good point.
We should review it.
That can make a flag for using copying time-stamps into the SKB more feasible.

>
> The ipcm and cork are admittedly cheap and not worth a branch. But
> still it is good to understand that this situation of unsynchronized
> clocks is a common operation condition for the foreseeable future, not
> an unfortunate constraint of a single piece of hardware.
>
> An extreme option would be moving everything behind a static_branch as
> most hot paths will not have the feature enabled. But I'm not
> seriously suggesting that for a few assignments.
>
>> The cookie and the cork are just intermediate from application to SKB, I do not think they cost much.
>> Both writes of time stamp to the cookie and the cork are conditioned.
>>
>>>
>>>>>>> Indeed, we want pacing offload to work for existing applications.
>>>>>>>
>>>>>> As the conversion of the PHC and the system clock is dynamic over time.
>>>>>> How do you propse to achive it?
>>>>>
>>>>> Can you elaborate on this concern?
>>>>
>>>> Using single time stamp have 3 possible solutions:
>>>>
>>>> 1. Current solution, synchronize the system clock and the PHC.
>>>> Application uses the system clock.
>>>> The ETF can use the system clock for ordering and pass the packet to the driver on time
>>>> The network interface hardware compare the time-stamp to the PHC.
>>>>
>>>> 2. The application convert the PHC time-stamp to system clock based.
>>>> The ETF works as solution 1
>>>> The network driver convert the system clock time-stamp back to PHC time-stamp.
>>>> This solution need a new Net-Link flag and modify the relevant network drivers.
>>>> Yet this solution have 2 problems:
>>>> * As applications today are not aware that system clock and PHC are not synchronized and
>>>> therefore do not perform any conversion, most of them only use the system clock.
>>>> * As the conversion in the network driver happens ~300 - 600 microseconds after
>>>> the application send the packet.
>>>> And as the PHC and system clock frequencies and offset can change during this period.
>>>> The conversion will produce a different PHC time-stamp from the application original time-stamp.
>>>> We require a precession of 1 nanoseconds of the PHC time-stamp.
>>>>
>>>> 3. The application uses PHC time-stamp for skb->tstamp
>>>> The ETF convert the PHC time-stamp to system clock time-stamp.
>>>> This solution require implementations on supporting reading PHC clocks
>>>> from IRQ/kernel thread context in kernel space.
>>>
>>> ETF has to release the packet well in advance of the hardware
>>> timestamp for the packet to arrive at the device on time. In practice
>>> I would expect this delta parameter to be at least at usec timescale.
>>> That gives some wiggle room with regard to s/w tstamp, at least.
>>
>> Yes, the author of the ETF uses a delta of 300 usec.
>> The interface I use for testing, Intel I210 need around 100 usec to 150 usec.
>> I believe it is related to PCIe speed of transferring the data on time and probably similar to other interfaces using PCIe.
>> If you overflow the interface hardware with high traffic the delta is much higher.
>> The clocks conversion error of the application is characteristic around ~1 usec to 5 usec for up to 10 ms sending a head.
>>
>>>
>>> If changes in clock distance are relatively infrequent, could this
>>> clock diff be a qdisc parameter, updated infrequently outside the
>>> packet path?
>>
>> As the clocks are updating of both frequency and offset dynamically make it very hard to perform.
>> The rate of the update of the PHC depends on PTP setting (usually around 1 second).
>> The rate of the update of the system clock depends how you synchronize it ( I assume it is similar or slower).
>> But user may require and use higher rates. It is only penalty by more traffic and CPU.
>> Bare in mind the 2 clocks synchronization are independent, the cross can be unpredictable.
>>
>> The ETF would have to "know" on which packets we use the previous update and which are the last update.
>> And hope we do not "miss" updates.
>>
>> And we would need a "service" to update these values with proper configuration, sound like overhead to me.
>
> Ack. Thanks for the operating context. I didn't know these constraints
> well enough. Agreed that this is not a very feasible approach then.
>
>> Another point.
>> The delta includes the PCIe DMA transfer speed, this is a hardware limitation.
>> The idea of TSN is that the application send the packet as closer as possible to the hardware send.
>> Increase the error of the clocks conversion defy the purpose of TSN.
>>
>> A more reasonable is to track the clocks inside the kernel.
>> As we mention on solution 3.
>>
>>>
>>> It would even be preferable if the qdisc and core stack could be
>>> ignorant of such hardware clocks and the time base is converted by the
>>> device driver when encoding skb->tstamp into the tx descriptor. Is the
>>> device hardware clock readable by the driver?
>>
>> All drivers that support PTP (IEEE 1558) have to read the PHC.
>> PTP is mandatory for TSN.
>> But some drivers might be limited on which context they can read the PHC.
>> This is a question to the vendors.
>> For example Intel I210 allow reading the PHC.
>>
>> However the kernel POSIX layer uses application context lockings.
>>
>>>
>>> From the above, it sounds like this is not trivial.
>>
>> I am not sure if it so complicated.
>> But as the Linux maintainers want to keep the Linux kernel with a single system clock.
>> It might be more of a political question, or perhaps a better planning then I did.
>
> This would seem the preferable option to me: use a kernel time base
> throughout the stack and limit knowledge of the hardware clock to the
> relevant hardware driver.
>
> If that is infeasible, then I don't immediately see an alternative to
> the current dual timestamp field variant, either.
>
>>>
>>> I don't know which exact device you're targeting. Is it an in-tree driver?
>>
>> ETF uses ethernet interfaces with IEEE 1558 and 802.1Qbv or 802.1Qbu.
>> Interfaces that support TSN, https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTime-Sensitive_Networking&data=04%7C01%7Cerez.geva.ext%40siemens.com%7C51af403dfc0041ced22e08d89d63db25%7C38ae3bcd95794fd4addab42e1495d55a%7C1%7C0%7C637432399491933753%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9xEP%2F2w3hrGItE0%2Btl0Bsy1xtUXwv4xUowuZ9QInXmI%3D&reserved=0
>> I use Intel I210 at the moment.
>> As of 5.10-rc6, there are 4 drivers
>> kernel-etf/drivers/net/ethernet (etf-5.10-rc6)$ find -name '*.c' | xargs grep -r TC_SETUP_QDISC_ETF
>> ./freescale/enetc/enetc.c: case TC_SETUP_QDISC_ETF:
>> ./stmicro/stmmac/stmmac_main.c: case TC_SETUP_QDISC_ETF:
>> ./intel/igc/igc_main.c: case TC_SETUP_QDISC_ETF:
>> ./intel/igb/igb_main.c: case TC_SETUP_QDISC_ETF:
>> There are more vendors like
>> Renesas that have a driver for the RZ-G SOC.
>> Broadcom have chips that support TSN, but they do not publish the code.
>> I believe that other vendors will add TSN support as it becomes more popular.
>
> Very clear. Thanks.
>
>>>
>>>> Just for clarification:
>>>> ETF as all Net-Link, only uses system clock (the TAI)
>>>> The network interface hardware only uses the PHC.
>>>> Nor Net-Link neither the driver perform any conversions.
>>>> The Kernel does not provide and clock conversion beside system clock.
>>>> Linux kernel is a single clock system.
>>>>
>>>>>
>>>>> The simplest solution for offloading pacing would be to interpret
>>>>> skb->tstamp either for software pacing, or skip software pacing if the
>>>>> device advertises a NETIF_F hardware pacing feature.
>>>>
>>>> That will defy the purpose of ETF.
>>>> ETF exist for ordering packets.
>>>> Why should the device driver defer it?
>>>> Simply do not use the QDISC for this interface.
>>>
>>> ETF queues packets until their delivery time is reached. It does not
>>> order for any other reason than to calculate the next qdisc watchdog
>>> event, really.
>>
>> No, ETF also order the packets on .enqueue = etf_enqueue_timesortedlist()
>> Our patch suggest to order them by hardware time stamp.
>> And leave the watchdog setting using skb->tstamp that hold system clock TAI time-stamp.
>>
>>>
>>> If h/w can do the same and the driver can convert skb->tstamp to the
>>> right timebase, indeed no qdisc is needed for pacing. But there may be
>>> a need for selective h/w offload if h/w has additional constraints,
>>> such as on the number of packets outstanding or time horizon.
>>
>> The driver do not order the packets, it send packets in the order of arrival.
>> We can add ETF component to each relevant driver, but do we want to add Net-Link features to drivers?
>> I think the purpose is to make the drivers as small as possible and leave common intelligence in the Net-Link layer.
>
> I was thinking of devices that implement ETF in hardware for full
> pacing hardware offload. Not in the driver.
>
>>>
>>>>>
>>>>> Clockbase is an issue. The device driver may have to convert to
>>>>> whatever format the device expects when copying skb->tstamp in the
>>>>> device tx descriptor.
>>>>
>>>> We do hope our definition is clear.
>>>> In the current kernel skb->tstamp uses system clock.
>>>> The hardware time-stamp is PHC based, as it is used today for PTP two steps.
>>>> We only propose to use the same hardware time-stamp.
>>>>
>>>> Passing the hardware time-stamp to the skb->tstamp might seems a bit tricky
>>>> The gaol is the leave the driver unaware to whether we
>>>> * Synchronizing the PHC and system clock
>>>> * The ETF pass the hardware time-stamp to skb->tstamp
>>>> Only the applications and the ETF are aware.
>>>> The application can detect by checking the ETF flag.
>>>> The ETF flags are part of the network administration.
>>>> That also configure the PTP and the system clock synchronization.
>>>>
>>>>>
>>>>>>
>>>>>>> It only requires that pacing qdiscs, both sch_etf and sch_fq,
>>>>>>> optionally skip queuing in their .enqueue callback and instead allow
>>>>>>> the skb to pass to the device driver as is, with skb->tstamp set. Only
>>>>>>> to devices that advertise support for h/w pacing offload.
>>>>>>>
>>>>>> I did not use "Fair Queue traffic policing".
>>>>>> As for ETF, it is all about ordering packets from different applications.
>>>>>> How can we achive it with skiping queuing?
>>>>>> Could you elaborate on this point?
>>>>>
>>>>> The qdisc can only defer pacing to hardware if hardware can ensure the
>>>>> same invariants on ordering, of course.
>>>>
>>>> Yes, this is why we suggest ETF order packets using the hardware time-stamp.
>>>> And pass the packet based on system time.
>>>> So ETF query the system clock only and not the PHC.
>>>
>>> On which note: with this patch set all applications have to agree to
>>> use h/w time base in etf_enqueue_timesortedlist. In practice that
>>> makes this h/w mode a qdisc used by a single process?
>>
>> A single process theoretically does not need ETF, just set the skb-> tstamp and use a pass through queue.
>> However the only way now to set TC_SETUP_QDISC_ETF in the driver is using ETF.
>
> Yes, and I'd like to eventually get rid of this constraint.
>
>
>> The ETF QDISC is per network interface.
>> So all application that uses a single network interface have to comply to the QDISC configuration.
>> Sound like any other new feature in the NetLink.
>>
>> Theoretically a single network interface could participate in 2 TSN/PTP domains.
>> In that case you can create one QDISC without "use hardware time-stamp" for first TSN/PTP domain and synchronize the PHC to system clock.
>> And use the second one with a QDISC that use hardware time-stamp.
>> You will need a driver/hardware that support multiple PHCs.
>> The separation of the domains could be using VLANs.
>>
>> Note: A TSN domain is bound to a PTP domain.
>>
>>>
>>>>>
>>>>> Btw: this is quite a long list of CC:s
>>>>>
>>>> I need to update my company colleagues as well as the Linux group.
>>>
>>> Of course. But even ignoring that this is still quite a large list (> 40).
>>>
>>> My response yesterday was actually blocked as a result ;) Retrying now.
>>>
>>
>> I left 5 people from Siemens, I hope it improves.
>>
>>
>> Thanks for your comments and enlightenments, they are very useful
>> Erez