Re: Kernel Concurrency Sanitizer (KCSAN)

From: Marco Elver
Date: Thu Dec 12 2019 - 15:53:44 EST


On Thu, 12 Dec 2019 at 10:57, Walter <truhuan@xxxxxxxxx> wrote:
>
> Hi Marco,
>
> Data racing issues always bothers us, we are happy to use this debug tool to
> detect the root cause. So, we need to understand this tool implementation,
> we try to trace your code and have some questions, would you take the free time
> to answer the question.
> Thanks.
>
> Question:
> We assume they access the same variable when use read() and write()
> Below two Scenario are false negative?
>
> ===
> Scenario 1:
>
> CPU 0: CPU 1:
> tsan_read() tsan_write()
> check_access() check_access()
> watchpoint=find_watchpoint() // watchpoint=NULL watchpoint=find_watchpoint() // watchpoint=NULL
> kcsan_setup_watchpoint() kcsan_setup_watchpoint()
> watchpoint = insert_watchpoint watchpoint = insert_watchpoint

Assumption: have more than 1 free slot for the address, otherwise
impossible that both set up a watchpoint.

> if (!remove_watchpoint(watchpoint)) // no enter, no report if (!remove_watchpoint(watchpoint)) // no enter, no report

Correct.

> ===
> Scenario 2:
>
> CPU 0: CPU 1:
> tsan_read()
> check_access()
> watchpoint=find_watchpoint() // watchpoint=NULL
> kcsan_setup_watchpoint()
> watchpoint = insert_watchpoint()
>
> tsan_read() tsan_write()
> check_access() check_access()
> find_watchpoint()
> if(expect_write && !is_write)
> continue
> return NULL
> kcsan_setup_watchpoint()
> watchpoint = insert_watchpoint()
> remove_watchpoint(watchpoint)
> watchpoint = INVALID_WATCHPOINT
> watchpoint = find_watchpoint()
> kcsan_found_watchpoint()

This is a bit incorrect, because if atomically setting watchpoint to
INVALID_WATCHPOINT happened before concurrent find_watchpoint(),
find_watchpoint will not return anything, thus not entering
kcsan_found_watchpoint. If find_watchpoint happened before setting
watchpoint to INVALID_WATCHPOINT, the rest of the trace matches.
Either way, no reporting will happen.

> consumed = try_consume_watchpoint() // consumed=false, no report

Correct again, no reporting would happen. While running, have a look
at /sys/kernel/debug/kcsan and look at the 'report_races' counter;
that counter tells you how often this case actually occurred. In all
our testing with the default config, this case is extremely rare.

As it says on the tin, KCSAN is a *sampling watchpoint* based data
race detector so all the above are expected. If you want to tweak
KCSAN's config to be more aggressive, there are various options
available. The most important ones:

* KCSAN_UDELAY_{TASK,INTERRUPT} -- Watchpoint delay in microseconds
for tasks and interrupts respectively. [Increasing this will make
KCSAN more aggressive.]
* KCSAN_SKIP_WATCH -- Skip instructions before setting up watchpoint.
[Decreasing this will make KCSAN more aggressive.]

Note, however, that making KCSAN more aggressive also implies a
noticeable performance hit.

Also, please find the latest version here:
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/log/?h=kcsan
-- there have been a number of changes since the initial version from
September/October.

Thanks,
-- Marco