Re: strange interaction between fuse + pidns

From: Eric W. Biederman
Date: Mon Jul 11 2022 - 19:06:53 EST


Tycho Andersen <tycho@tycho.pizza> writes:

> On Mon, Jul 11, 2022 at 04:37:12PM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <tycho@tycho.pizza> writes:
>>
>> > Hi all,
>> >
>> > On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
>> >> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>> >> >
>> >> > Can you try the attached untested patch?
>> >>
>> >> Updated patch to avoid use after free on req->args.
>> >>
>> >> Still mostly untested.
>> >
>> > Thanks, when I applied your patch, I still ended up with tasks stuck
>> > waiting with a SIGKILL pending. So I looked into that and came up with
>> > the patch below. With both your patch and mine, my testcase exits
>> > cleanly.
>> >
>> > Eric (or Christian, or anyone), can you comment on the patch below? I
>> > have no idea what this will break. Maybe instead a better approach is
>> > some additional special case in __send_signal_locked()?
>> >
>> > Tycho
>> >
>> > From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
>> > From: Tycho Andersen <tycho@tycho.pizza>
>> > Date: Mon, 11 Jul 2022 11:26:58 -0600
>> > Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
>> > signals
>> >
>> > The wait_* code uses signal_pending_state() to test whether a thread has
>> > been interrupted, which ultimately uses __fatal_signal_pending() to detect
>> > if there is a fatal signal.
>> >
>> > When a pid ns dies, in zap_pid_ns_processes() it does:
>> >
>> > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>> >
>> > for all the tasks in the pid ns. That calls through:
>> >
>> > group_send_sig_info() ->
>> > do_send_sig_info() ->
>> > send_signal_locked() ->
>> > __send_signal_locked()
>> >
>> > which does:
>> >
>> > pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>> >
>> > which puts sigkill in the set of shared signals, but not the individual
>> > pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
>> > operation), they won't see this shared signal, and will hang forever, since
>> > TIF_SIGPENDING is set, but the fatal signal can't be detected.
>>
>> Hmm.
>>
>> That is perplexing.
>
> Thanks for taking a look.
>
>> __send_signal_locked calls complete_signal. Then if any of the tasks of
>> the process can receive the signal, complete_signal will loop through
>> all of the tasks of the process and set the per thread SIGKILL. Pretty
>> much by definition tasks can always receive SIGKILL.
>>
>> Is complete_signal not being able to do that?
>
> In my specific case it was because my testcase was already trying to
> exit and had set PF_EXITING when the signal is delivered, so
> complete_signal() was indeed not able to do that since PF_EXITING is
> checked before SIGKILL in wants_signal().
>
> But I changed my testacase to sleep instead of exit, and I get the
> same hang behavior, even though complete_signal() does add SIGKILL to
> the set. So there's something else going on there...
>
>> The patch below really should not be necessary, and I have pending work
>> that if I can push over the finish line won't even make sense.
>>
>> As it is currently an abuse to use the per thread SIGKILL to indicate
>> that a fatal signal has been short circuit delivered. That abuse as
>> well as being unclean tends to confuse people reading the code.
>
> How close is your work? I'm wondering if it's worth investigating the
> non-PF_EXITING case further, or if we should just land this since it
> fixes the PF_EXITING case as well. Or maybe just do something like
> this in addition:

It is not different enough to change the semantics. What I am aiming
for is having a dedicated flag indicating a task will exit, that
fatal_signal_pending can check. And I intend to make that flag one way
so that once it is set it will never be cleared.

Which sort of argues against your patch below.

It most definitely does not sort out the sleep case.


The other thing I have played with that might be relevant was removing
the explicit wait in zap_pid_ns_processes and simply not allowing wait
to reap the pid namespace init until all it's children had been reaped.
Essentially how we deal with the thread group leader for ordinary
processes. Does that sound like it might help in the fuse case?

I am not certain how far such a change would be (it has been a couple
years since I played with implementing it) but it can be made much
sooner if it demonstratively breaks some dead-locks, and generally makes
the kernel easier to maintain.

Eric