Re: [tip:sched/core] [sched/fair] 79104becf4: BUG:kernel_NULL_pointer_dereference,address

From: Philip Li

Date: Fri Nov 07 2025 - 05:16:16 EST


On Wed, Nov 05, 2025 at 08:06:32PM +0800, Philip Li wrote:
> On Wed, Nov 05, 2025 at 12:00:26PM +0100, Peter Zijlstra wrote:
> > On Tue, Oct 28, 2025 at 10:30:08AM +0800, Chen, Yu C wrote:
> > > On 10/27/2025 10:09 PM, Peter Zijlstra wrote:
> > > > On Mon, Oct 27, 2025 at 03:07:18PM +0100, Peter Zijlstra wrote:
> > > > > On Mon, Oct 27, 2025 at 02:55:16PM +0100, Peter Zijlstra wrote:
> > > > >
> > > > > > > May I know if you are using the kernel config 0day attached?
> > > > > > > I found that the config 0day attached
> > > > > > > (https://download.01.org/0day-ci/archive/20251021/202510211205.1e0f5223-lkp@xxxxxxxxx/config-6.18.0-rc1-00001-g79104becf42b)
> > > > > > > has
> > > > > > > CONFIG_IA32_EMULATION=y
> > > > > > > CONFIG_IA32_EMULATION_DEFAULT_DISABLED=y
> > > > >
> > > > > Yep, deleting that entry makes it all work.
> > > >
> > > > 'work' might be over stating, it boots and starts trinity, which then
> > > > promptly (as in a handful of seconds) triggers OOM and dies. Not
> > > > actually reproducing the NULL deref I was looking for.
> > >
> > > Change the following line in job-script
> > > export memory='16G'
> > > to
> > > export memory='64G'
> > > ?
> >
> > Yes, that seems to help.
> >
> > > I had a try and can reproduce the NULL except at first run:
> >
> > Took me two runs, but yes, I can see it now.
> >
> > Anyway, this is two bugs in the robot, can we please fix all this to not
> > happen again?
>
> Got it, I will dig into the detail to understand the difference of local
> reproduce and internal cluster run. The image, kconfig, and memory
> are exactly the same for actual robot run and provided reproduce instruction,
> since the attachment is reproduced from the job execution. I didn't find the
> cause quickly, and i will be back to this asap and provide update.
>
> >
> > - .config has 32bit disabled while robot provides 32bit images. Clearly
> > the actual robot runs 64bit images and the reproduction should
> > provide those too.

Some update that this one is resolved, the cluster run has set ia32_emulation=on
in kernel cmdline, which is missed to set in the reproduce step.

> >
> > - job description is inaccurate in the amount of memory required.

Got it, the cluster run with 16G has 40% rate (in about 20 runs), now i
have increased the memory to 32G so it will reduce the OOM chance in local
reproduction.

> >
> > The reproduction steps must exactly match what the real robot runs, not
> > something else.

Sorry for wrong reproduce steps, we should be more careful to make it consistent.

And thanks again to Peter and Yu.

> >