Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Machine Learning (ML) library in Linux kernel

From: Chris Mason

Date: Tue Feb 10 2026 - 09:22:37 EST


On 2/10/26 8:47 AM, Jan Kara wrote:
> On Mon 09-02-26 22:28:59, Viacheslav Dubeyko via Lsf-pc wrote:
>> On Mon, 2026-02-09 at 02:03 -0800, Chris Li wrote:
>>> On Fri, Feb 6, 2026 at 11:38 AM Viacheslav Dubeyko
>>> <Slava.Dubeyko@xxxxxxx> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Machine Learning (ML) is approach/area of learning from data,
>>>> finding patterns, and making predictions without implementing algorithms
>>>> by developers. The number of areas of ML applications is growing
>>>> with every day. Generally speaking, ML can introduce a self-evolving and
>>>> self-learning capability in Linux kernel. There are already research works
>>>> and industry efforts to employ ML approaches for configuration and
>>>> optimization the Linux kernel. However, introduction of ML approaches
>>>> in Linux kernel is not so simple and straightforward way. There are multiple
>>>> problems and unanswered questions on this road. First of all, any ML model
>>>> requires the floating-point operations (FPU) for running. But there is
>>>> no direct use of FPUs in kernel space. Also, ML model requires training phase
>>>> that can be a reason of significant performance degradation of Linux kernel.
>>>> Even inference phase could be problematic from the performance point of view
>>>> on kernel side. The using of ML approaches in Linux kernel is inevitable step.
>>>> But, how can we use ML approaches in Linux kernel? Which infrastructure
>>>> do we need to adopt ML models in Linux kernel?
>>>
>>> I think there are two different things, I think you want the latter
>>> but I am not sure
>>>
>>> 1) using ML model to help kernel development, code reviews, generate
>>> patches by descriptions etc. For example, Chris Mason has a kernel
>>> review repo on github and he is sharing his review finding the mailing
>>> list:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_masoncl_review-2Dprompts_tree_main&d=DwIFaQ&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=vvrDPxyw_JXPrkC8BjzA2kEtwdPfwV2gBMEXG7ZveXM4LhS01LfoGwqhEyUZpPe4&s=rqNez5_rmiEuE7in5e_7MfyUzzqzaA6Gk46WWvmN3yk&e=
>>> It is kernel development related, but the ML agent code is running in
>>> the user space. The actual ML computation might run GPU/TPUs. That
>>> does not seem to be what you have in mind.
>>>
>>> 2) Run the ML model computation in the kernel space.
>>> Can you clarify if this is what you have in mind? You mention kernel
>>> FPU usage in the kernel for ML model. It is only relevant if you need
>>> to run the FP in the kernel CPU instructions. Most ML computations are
>>> not run in CPU instructions. They run on GPUs/TPUs. Why not keep the
>>> ML program (PyTorch/agents) in the user space and pass the data to the
>>> GPU/TPU driver to run? There will be some kernel instructure like
>>> VFIO/IOMMU involved with the GPU/TPU driver. For the most part the
>>> kernel is just facilitating the data passing to/from the GPU/TPU
>>> driver then to the GPU/TPU hardware. The ML hardware is doing the
>>> heavy lifting.
>>
>> The idea is to have ML model running in user-space and kernel subsystem can
>> interact with ML model in user-space. As the next step, I am considering two
>> real-life use-cases: (1) GC subsystem of LFS file system, (2) ML-based DAMON
>> approach. So, for example, GC can be represented by ML model in user-space. GC
>> can request data (segments state) from kernel-space and ML model in user-space
>> can do training or/and inference. As a result, ML model in user-space can select
>> victim segments and instruct kernel-space logic of moving valid data from victim
>> segment(s) into clean/current one(s).
>
> To be honest I'm skeptical about how generic this can be. Essentially
> you're describing a generic interface to offload arbitrary kernel decision
> to userspace. ML is a userspace bussiness here and not really relevant for
> the concept AFAICT. And we already have several ways of kernel asking
> userspace to do something for it and unless it is very restricted and well
> defined it is rather painful, prone to deadlocks, security issues etc.
>
> So by all means if you want to do GC decisions for your filesystem in
> userspace by ML, be my guest, it does make some sense although I'd be wary
> of issues where we need to writeback dirty pages to free memory which may
> now depend on your userspace helper to make a decision which may need the
> memory to do the decision... But I don't see why you need all the ML fluff
> around it when it seems like just another way to call userspace helper and
> why some of the existing methods would not suffice.

Looking through the description (not the code, apologies), it really
feels like we're reinventing BPF here:

- introspection into what the kernel is currently doing
- communications channel with applications
- a mechanism to override specific kernel functionality
- fancy applications arbitrating decisions.

My feedback during plumbers and also today is that you can get 99% of
what you're looking for with some BPF code.

It may or may not be perfect for your needs, but it's a much faster path
to generate community and collaboration around the goals. After that,
it's a lot easier to justify larger changes in the kernel.

If this becomes an LSF/MM topic, my bar for discussion would be:
- extensive data collected about some kernel component (Damon,
scheduling etc)
- working proof of concept that improved on decisions made in the kernel
- discussion of changes needed to improve or enable the proof of concept

In other words, I don't think we need a list of ways ML might be used.
I think we need specific examples of a way that ML was used and why it's
better than what the kernel is already doing.

-chris