The key here is that we don't want other incoming readers to observe
that there are waiters in the wait queue and hence have to go into the
slowpath until the single waiter in the queue is sure that it probably
will need to go to sleep if there is writer.
With a constant stream of incoming readers, a major portion of them will
observe the a negative count and be serialized to enter the slowpath.
There are certainly other readers that do not observe the negative count
in the in between period after one reader clear the count in the unlock
path and a waiter set the count to negative again. Those readers can go
ahead and do the read in parallel. But it is the serialized readers that
cause the performance loss and the observation of spinlock contention in
the perf output.
It is the constant stream of incoming readers that sustain the spinlock
queue and the repeated clearing and negative setting of the count.