But it was not me who claimed that 'workqueues are slow'.
choice. I am just wondering out loud whether this particular tool, in its current usage pattern, makes much technological sense. My claim is: it could very well be that it doesnt make _much_ sense, and in that case we should provide a non-intrusive migration path away in terms of a compatible API wrapper to a saner (albeit by virtue of trying to emulate an existing API, slower) mechanism. The examples cited so far had the tasklet as an intermediary towards a softirq - what's the technological point in such a splitup?
The most scalable workloads dont involve any (or many) softirq middlemen at all: you queue work straight from the hardirq context to the target process context. And that's what you want to do _anyway_, because you want to create as little locally cached data for the hardirq context, as the target task could easily be on another CPU. (this is generally true for things like block IO, but it's also true for things like network IO.)
the most scalable solution would be _for the network adapter to figure out the target CPU for the packet_.
Not many (if any) such adapters exist at the moment. (as it would involve allocating NR_CPUs irqs to that adapter alone.)
Tasklet is single thread by definition and purpose. Those a few places where people used tasklets to do per-cpu jobs (RCU f.e.) exist just because they had troubles with allocating new softirq. [...]
no. The following tale is the true and only history of the RCU tasklet ;-) The RCU guys first used a tasklet, then noticed its bad scalability (a particular VFS-intense benchmark regressed because only a single CPU would do RCU completion on an 8-way box) so they switched it to a per-cpu tasklet - without realizing that a per-cpu tasklet is in essence a softirq. I pointed it out to them (years down the road ...) then the "convert rcu-tasklet to softirq" patch was born.
outlined above: if you want good scalability, dont use middlemen :-) Figure out the target task as early as possible and let it do as much of the remaining work as possible. _Increasing_ the amount of cached context (by doing delayed processing in tasklets or even softirqs on the same CPU where the hardirq arrived) only increases the cross-CPU cost. Keeping stuff in a softirq only makes (some) sense as long as you have no target task at all (routing, filtering, etc.).