Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly

From: Linus Torvalds
Date: Fri Apr 24 2020 - 14:02:57 EST


On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>
> At one point my brain I had forgetten that xchg can not take two memory
> arguments and had hoped to be able to provide stronger guarnatees than I
> can. Which is where I think the structure of exchange_pids came from.

Note that even if we were to have a "exchange two memory locations
atomically" instruction (and we don't - even a "double cmpxchg" is
actually just a double-_sized_ one, not a two different locations
one), I'm not convinced it makes sense.

There's no way to _walk_ two lists atomically. Any user will only ever
walk one or the other, so it's not sensible to try to make the two
list updates be atomic.

And if a user for some reason walks both, the walking itself will
obviously then be racy - it does one or the other first, and can see
either the old state, or the new state - or see _neither_ (ie if you
walk it twice, you might see neither task, or you might see both, just
depending on order or walk).

> I do agree the clearer we can write things, the easier it is for
> someone else to come along and follow.

Your alternate write of the function seems a bit more readable to me,
even if the main effect might be just that it was split up a bit and
added a few comments and whitespace.

So I'm more happier with that one. That said:

> We can not use a remove and reinser model because that does break rcu
> accesses, and complicates everything else. With a swap model we have
> the struct pids pointer at either of the tasks that are swapped but
> never at nothing.

I'm not suggesting removing the pid entirely - like making task->pid
be NULL. I'm literally suggesting just doing the RCU list operations
as "remove and re-insert".

And that shouldn't break anything, for the same reason that an atomic
exchange doesn't make sense: you can only ever walk one of the lists
at a time. And regardless of how you walk it, you might not see the
new state (or the old state) reliably.

Put another way:

> void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
> {
> struct hlist_node **lpprev = left->pprev;
> struct hlist_node **rpprev = right->pprev;
>
> rcu_assign_pointer(*lpprev, right);
> rcu_assign_pointer(*rpprev, left);

These are the only two assignments that matter for anything that walks
the list (the pprev ones are for things that change the list, and they
have to have exclusions in place).

And those two writes cannot be atomic anyway, so you fundamentally
will always be in the situation that a walker can miss one of the
tasks.

Which is why I think it would be ok to just do the RCU list swap as a
"remove left, remove right, add left, add right" operation. It doesn't
seem fundamentally different to a walker than the "switch left/right"
operation, and it seems much simpler.

Is there something I'm missing?

But I'm *not* suggesting that we change these simple parts to be
"remove thread_pid or pid pointer, and then insert a new one":

> /* Swap thread_pid */
> rpid = left->thread_pid;
> lpid = right->thread_pid;
> rcu_assign_pointer(left->thread_pid, lpid);
> rcu_assign_pointer(right->thread_pid, rpid);
>
> /* Swap the cached pid value */
> WRITE_ONCE(left->pid, pid_nr(lpid));
> WRITE_ONCE(right->pid, pid_nr(rpid));
> }

because I agree that for things that don't _walk_ the list, but just
look up "thread_pid" vs "pid" atomically but asynchronously, we
obviously need to get one or the other, not some kind of "empty"
state.

> Does that look a little more readable?

Regardless, I find your new version at least a lot more readable, so
I'm ok with it.

It looks like Oleg found an independent issue, though.

Linus