We could do that with the whole "task_work" thing (or perhaps justThis would be very good and it needs to work in interrupt context (softirq) also, and when we interrupt idle task. It's with networking we can really hit kernel_fpu_begin()/end() millions of times per second and there's really only need to do it once per interrupt. This is actually similar what I was doing (in do_softirq)) when I noticed eagerfpu was broken and now Nate's bug AFAICS happens there as well.
do_notify_resume(), especially after merging the "don't necessarily
return with iret" patch I sent out earlier), with additionally making
sure that scheduling does the right thing wrt a "currently dirty math
state due to kernel use".
The advantage of that would be that we really could do a *lot* of FP
math very cheaply in the kernel, because we'd pay the overhead of
kernel_fpu_begin/end() just once (well, the "end" part would be just
setting the bit that we now have dirty state, the cost would be in the
return-to-user-space-and-restore-fp-state part).
Comments? That would be much more invasive than just changing
__kernel_fpu_end(), but would bring in possibly quite noticeable
advantages under loads that use the FP/vector resources in the kernel.