Re: [RFC] BPF fault/jitter-injection framework

From: Alexei Starovoitov
Date: Fri May 02 2025 - 22:11:17 EST


On Thu, May 1, 2025 at 9:10 PM Sergey Senozhatsky
<senozhatsky@xxxxxxxxxxxx> wrote:
>
> Greetings,
>
> I've been thinking what if we had a BPF jitter/fault injection framework
> for more fine-grained and configurable kernel testing. Current fault
> injection doesn't support function arguments analysis, with BPF we
> can have something like
>
> // of course bpf_schedule_timeout() doesn't exist yet
> call bpf_schedule_timeout(120) in blk_execute_rq(rq) if
> rq->q->disk->major == 8 && rq->q->disk->first_minor == 0
>
> So that would introduce blk request execution timeouts/jitters for a
> particular gendisk only. And so on.
>
> Has this been discussed before? Does this approach even make sense
> or is there a better (another) way to do this?

I think it makes sense.
That was the motivation for us to do:

$ git grep ALLOW_ERROR_INJECTION fs/
fs/btrfs/ctree.c:ALLOW_ERROR_INJECTION(btrfs_cow_block, ERRNO);
fs/btrfs/ctree.c:ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO);
fs/btrfs/disk-io.c:ALLOW_ERROR_INJECTION(open_ctree, ERRNO);
fs/btrfs/free-space-cache.c:ALLOW_ERROR_INJECTION(io_ctl_init, ERRNO);
fs/btrfs/relocation.c:ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
fs/btrfs/tree-checker.c:ALLOW_ERROR_INJECTION(btrfs_check_leaf, ERRNO);
fs/btrfs/tree-checker.c:ALLOW_ERROR_INJECTION(btrfs_check_node, ERRNO);

The one in open_ctree() actually found a few bugs.
It's a success story.

Targeted error injection works better than random fuzzing.

To call schedule_timeout() bpf program needs to be sleepable.
Majority of LSM and ALLOW_ERROR_INJECTION hooks are sleepable.
All syscalls are sleepable too.
So most of the infrastructure is already available.

Add bpf_schedule_timeout() kfunc and ALLOW_ERROR_INJECTION
where it matters and it's good to go.
kfunc and error inject marks are non binding.
We can remove them if this experiment doesn't work out.