Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf() if no eligible entity is found

From: Chen Yu
Date: Mon Apr 15 2024 - 04:03:34 EST


On 2024-04-15 at 09:22:51 +0200, Peter Zijlstra wrote:
> On Tue, Apr 09, 2024 at 11:21:04AM +0200, Peter Zijlstra wrote:
>
> > Is there any sane way to reproduce this, and how often does it happen?
>
> This, how do you all make it go bang?

It was reproduced in lkp's environment, and originally reported here:
https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@xxxxxxxxx/

It is a trinity test on a vm guest, and seems like be triggered after some futex
test. And it was reproduced at a rate of 23/999 according to that report.
Previously I could not reproduce it locally, so lkp has helped test my debug
patch in their environment and got the clue that it was broken by s64 overflow.

Breno told me that he has reproduce this issue with KASAN on and using:
'stress-ng -a 20', but I can not reproduce it locally neither.

I'm thinking of creating a debug patch to trace all the changes related to
cfs_rq->avg_vruntime in avg_vruntime_add()/sub().
To see how cfs_rq->avg_vruntime gets far behind the cfs_rq->min_vruntime, which
caused the overflow. My understanding is that, the se's vruntime is got from
cfs_rq->avg_vruntime in place_entity(), if the se's vruntime gets an extrem
smaller value than cfs_rq->min_vruntime, then it might indicate that there could
be something wrong with the update of cfs_rq->avg_vruntime.
Then lkp could help us to further debug.

Or do you have any suggestion/suspect that how to narrow down this, I could
try as you suggest.

thanks,
Chenyu