Instead, how about just removing the release from atomic_long_sub_return_release()
such that the osq load is not hoisted over the atomic compound (along with Peter's
comment):
This solution will actually enforce a stronger (full) ordering w.r.t. the
solution described by Prateek and Peter. Also, it will "trade" two lwsync
for two sync (powerpc), one dmb.ld for one dmb (arm64).
What are the reasons you would prefer this?