Re: Introduce Sashiko (agentic review of Linux kernel changes)

From: Roman Gushchin

Date: Thu Mar 19 2026 - 18:54:32 EST

"Lorenzo Stoakes (Oracle)" <ljs@xxxxxxxxxx> writes:

> On Wed, Mar 18, 2026 at 11:33:22AM -0700, Roman Gushchin wrote:
>> "Lorenzo Stoakes (Oracle)" <ljs@xxxxxxxxxx> writes:
>>
>> > On Tue, Mar 17, 2026 at 03:31:11PM +0000, Roman Gushchin wrote:
>> >> Hello,
>> >>
>> >> I'm happy to share something my colleagues and I have been working on
>> >> for the last several months:
>> >> Sashiko - an agentic system for Linux kernel changes.
>> >>
>> >> First, Sashiko is available as a service at:
>> >> * https://sashiko.dev
>> >>
>> >
>> > ...
>> >
>> > (For one I'm going to go fix some bugs on my series I saw reported there).
>> >
>> > I think over time as the approach/model is refined this will get a LOT
>> > better, it seems these things can acelerate quickly.
>>
>> Hi Lorenzo,
>>
>> Thank you for kind words!
>
> No problem, thanks for your hard work! :)
>
>>
>> RE false positives: I think Chris's prompts were initially heavily
>> biased towards avoiding false positives, but it comes at the cost of
>> missing real issues (in general, I don't have hard data on % of findings).
>> Now he also is looking to relax it a bit, to my knowledge.
>> But then there are different models in use, different protocols, etc.
>>
>> I also have a notion of issue severity and I was thinking about
>> e.g. sending out only reviews revealing critical & high severity bugs
>> (e.g. memory corruptions & panics). Or maybe send the feedback to the
>> author in any case (e.g. for fixing typos), but cc maintainers only if
>> there are serious concerns.
>>
>> And obviously no pressure, I won't enable any public email sending
>> unless there is a consensus across maintainers of the corresponding
>> subsystem.
>
> I think maybe an opt-in thing might work for some of us?

Absolutely, I think with mm we can start with replying to the author and
a dedicated list of volunteers.

> But yeah we can take our time with this, Andrew is looking, I am for
> sure.

Thank you!

>
> Oh and one data point -
> https://lore.kernel.org/linux-mm/cover.1773846935.git.ljs@xxxxxxxxxx/
>
> Read the v3 change log for a list of the issues it correctly raised for that
> series, so it's definitely useful.
>
> It was about maybe 50/50 noise/signal I think?
>
> But as you can see that's already very useful thank you and has fixed a
> bunch of bugs in that codde!
>
> I'm not sure what Chris is planning, and I keep not going to the AI
> meetings for various reasons (other stuff clashing/away/tired sometimes :)
> but I wonder how we will sync up with Chris's review bot experiments?

So as Chris said, we're syncing regularly and actively thinking how to
organize it. I think we both want to share as much stuff as possible.

The hard part is that we can't easily test each others setup and it's
all very brittle. Initially I tried to use Chris's prompts directly with
only minimal changes, but it was hard to keep Sashiko stable. Plus the
new multi-stage protocol improved the discovery rate by almost 10%,
which was hard to ignore.

My current thinking (and things evolving quickly, so I might have a
different opinion in a couple of weeks) is that we need to separate
per-subsystem knowledge, make sure it's not containing any imperative
instructions or llm/tools specifics and share it completely. We can move
it to a separate repo or even put into the kernel tree, it's all
debatable. In a way, these prompts should be owned by subsystem
maintainers more than anyone else.

Then there are things which can be shared, but are not subsystem-specific.
E.g. an instruction on how to assess issue severity.

And then there is a specific review protocol, which significantly
depends on the tooling and LLM being used. This part is hard to share,
but also it's the place where a lot of experimentation is happening,
so maybe it's fine to have multiple tools. And they might be optimized
for different use cases: e.g. for personal development it might be
beneficial to have a live interaction with llm on the review material
(someone already asked me about this); but for sashiko.dev's mass review
case I do care a lot about the stability and token efficiency.

>> >>
>> >> * What's next?
>> >>
>> >> This is our first version and it's obviously not perfect. There is a
>> >> long list of fixes and improvements to make. Please, don't expect it to
>> >> be 100% reliable, even though we'll try hard to keep it up and running.
>> >> Please use github issues or email me any bug reports and feature
>> >> requests, or send PR's.
>> >
>> > Of course, it's all much appreicated!
>> >
>> >>
>> >> As of now, Sashiko only provides a web interface;
>> >> however, Konstantin Ryabitsev is already adding sashiko.dev support to b4,
>> >> and SeongJae Park is adding support to hkml.
>> >> That was really fast, thank you!
>> >
>> > Thanks to Konstantantin and SJ too but the web interface is pretty nice I
>> > must say so thanks for that! :)
>> >
>> >>
>> >> We're working on adding an email interface to Sashiko, and soon Sashiko
>> >> will be able to send out reviews over email - similar to what the bpf
>> >> subsystem already has. It will be opt-in by subsystem and will have options
>> >
>> > Like I said, I think it's a bit premature for mm at least _at this point_
>> > but I'm sure it'll get there.
>>
>> I'd really appreciate (and actually need) yours and other maintainers and
>> developers feedback here. Even though I can't fix every single false
>> positive as a code issue, I can hopefully tackle some common themes.
>
> Is there a way for us to point out which parts of a review are signal and
> which are noise?

Not yet. I think answering emails is the easiest part and I plan to
teach Sashiko to recognize these answers and analyze them. Maybe Sashiko
can even adjust it's own prompts in a (semi)-automatic way, Idk.

>
> If you could update the web interface for feedback that'd be really handy,
> though I guess there's the painful stuff of having to have users and
> etc. for that :)

Yeah, I'm afraid we might end up trying to build a new JIRA this way...

>
>>
>> Chris did a fantastic work on the bpf subsystem (and several others) by
>> manually analyzing replies to the AI feedback and adjusting prompts. Now
>> we need to repeat this for all other subsystems.
>
> Yeah, I'm happy to feedback if there's a fairly low friction way of doing
> it, but constant workload makes it hard if it requires much more
> effort :)

Can't agree more :)

>
>>
>> >
>> > For now I think we need to get the false positive rate down a fair bit
>> > otherwise it might be a little distracitng.
>> >
>> > But people are _already_ integrating the web interface into workflows, I
>> > check it now, and Andrew is already very keen :) see:
>> >
>> > https://lore.kernel.org/all/20260317121736.f73a828de2a989d1a07efea1@xxxxxxxxxxxxxxxxxxxx/
>> > https://lore.kernel.org/all/20260317113730.45d5cef4ba84be4df631677f@xxxxxxxxxxxxxxxxxxxx/
>> >
>> >> to CC only the author of the patch, maintainers, volunteers, or send a
>> >> fully public reply. If you're a maintainer and have a strong preference
>> >> to get reviews over email, please let me know.
>> >
>> > Well as maintainer I think 'not quite yet' but probably soon is the answer
>> > on that one!
>> >
>> >>
>> >> We also desperately need better benchmarks, especially when it comes to
>> >> false positives. Having a decent vetted set of officially perfect
>> >> commits can help with this.
>> >
>> > Not sure perfect commits exist in the kernel certainly not mine :P
>>
>> Same here :) This is why it's so hard.
>
> Yes, but worthwhile! LLMs are surprisingly good at figuring out issues in
> things, it's a real strength.
>
> And it's already improving the code.
>
>>
>> >
>> >>
>> >> Finally, some subsystems have a good prompts coverage and some don't. It
>> >> doesn't have to be lengthy documentation (and it might actually be
>> >> counter-productive), but having a small list of things to look at - some
>> >> high-level concepts which are hard to grasp from the code, etc. - can
>> >> help a lot with both bug discovery and false positives.
>> >
>> > I guess best contributed to Chris's review-prompts repo right?
>>
>> Both works for me now, we'll figure out with Chris how to sync our
>> prompts. The small problem is that we're using various models, tools and
>> review protocols and barely can test each other's setup. And it's all
>> very fragile, so it's not exactly trivial.
>> But we'll figure out something soon.
>
> Yeah, part of the fun I guess :)
>
>>
>> In general we need to carefully separate instructions (like which tools
>> to use, which prompts to load etc) from factual data. Then we can easily
>> use the factual data with various tooling around.
>
> Hopefully I find some time to contribute some mm-specific stuff too :)

Awesome, waiting for it!

Thanks!