Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
From: Oleg Nesterov
Date: Mon Jun 05 2023 - 10:21:55 EST
On 06/02, Linus Torvalds wrote:
>
> On Fri, Jun 2, 2023 at 1:59 PM Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> >
> > As I said from the very beginning, this code is fine on x86 because
> > atomic ops are fully serialised on x86.
>
> Yes. Other architectures require __smp_mb__{before,after}_atomic for
> the bit setting ops to actually be memory barriers.
>
> We *should* probably have acquire/release versions of the bit test/set
> helpers, but we don't, so they end up being full memory barriers with
> those things. Which isn't optimal, but I doubt it matters on most
> architectures.
>
> So maybe we'll some day have a "test_bit_acquire()" and a
> "set_bit_release()" etc.
In this particular case we need clear_bit_release() and iiuc it is
already here, just it is named clear_bit_unlock().
So do you agree that vhost_worker() needs smp_mb__before_atomic()
before clear_bit() or just clear_bit_unlock() to avoid the race with
vhost_work_queue() ?
Let me provide a simplified example:
struct item {
struct llist_node llist;
unsigned long flags;
};
struct llist_head HEAD = {}; // global
void queue(struct item *item)
{
// ensure this item was already flushed
if (!test_and_set_bit(0, &item->flags))
llist_add(item->llist, &HEAD);
}
void flush(void)
{
struct llist_node *head = llist_del_all(&HEAD);
struct item *item, *next;
llist_for_each_entry_safe(item, next, head, llist)
clear_bit(0, &item->flags);
}
I think this code is buggy in that flush() can race with queue(), the same
way as vhost_worker() and vhost_work_queue().
Once flush() clears bit 0, queue() can come on another CPU and re-queue
this item and change item->llist.next. We need a barrier before clear_bit()
to ensure that next = llist_entry(item->next) in llist_for_each_entry_safe()
completes before the result of clear_bit() is visible to queue().
And, I do not think we can rely on control dependency because... because
I fail to see the load-store control dependency in this code,
llist_for_each_entry_safe() loads item->llist.next but doesn't check the
result until the next iteration.
No?
Oleg.