Re: [net/sched] Question: Locks for clearing ERR_PTR() value from idrinfo->action_idr ?

From: Pedro Tammela
Date: Fri Jun 14 2024 - 08:13:38 EST


On 14/06/2024 01:00, Tetsuo Handa wrote:
On 2024/06/14 11:47, Pedro Tammela wrote:
On 13/06/2024 21:58, Tetsuo Handa wrote:

Is there a possibility that tcf_idr_check_alloc() is called without holding
rtnl_mutex?

There is, but not in the code path of this reproducer.

If yes, adding a sleep before "goto again;" would help. But if no,
is this a sign that some path forgot to call tcf_idr_{cleanup,insert_many}() ?

The reproducer is sending a new action message with 2 actions.
Actions are committed to the idr after processing in order to make them visible
together and after any errors are caught.

The bug happens when the actions in the message refer to the same index. Since
the first processing succeeds, adding -EBUSY to the index, the second processing,
which references the same index, will loop forever.

After the change to rely on RCU for this check, instead of the idr lock, the hangs
became more noticeable to syzbot since now it's hanging a system-wide lock.

Thank you for explanation. Then, what type of sleep do we want?

This patch should fix the bug: https://lore.kernel.org/netdev/20240613071021.471432-1-druth@xxxxxxxxxxxx/