I thought you were concerned about cpu 0 doing a gup_fast(), cpu 1 doing P->N, and cpu 2 doing N->P. In this case cpu 2 is waiting on the pte lock.
The issue is that if cpu 0 is doing a gup_fast() and other cpus are doing P->P updates, then gup_fast() can potentially get a mix of old and new pte values - where P->P is any aggregate set of unsynchronized P->N and N->P operations on any number of other cpus. Ah, but if every P->N is followed by a tlb flush, then disabling interrupts will hold off any following N->P, allowing gup_fast to get a consistent pte snapshot.
Hm, awkward if flush_tlb_others doesn't IPI...
Simplest fix is to make gup_get_pte() a pvop, but that does seem like putting a red flag in front of an inner-loop hotspot, or something...
The per-cpu tlb-flush exclusion flag might really be the way to go.