Re: [openib-general] Re: [PATCH 35 of 53] ipath - some interrelatedstability and cleanliness fixes
From: Dave Olson
Date: Thu May 18 2006 - 00:13:28 EST
On Wed, 17 May 2006, Roland Dreier wrote:
| Dave> We are seeing a bug (with both our driver native MPI
| Dave> processes and mthca mvapic), where when 8 processes using
| Dave> "simultaneously exit", we get watchdogs and/or hangs in the
| Dave> close routines. Moving the freeing outside the mutex was an
| Dave> attempt to see if we were running into some VM issues by
| Dave> doing lots of page unlocking and freeing with the mutex
| Dave> held. It seemed to help somewhat, but not to solve the
| Dave> problem.
| Am I understanding correctly that you see a hang or watchdog timeout
| even with the mthca driver?
Yes. That is, the symptoms are the same, although the cause
may be different.
| Is there any possibility of posting the test case to reproduce this?
It's the MPI job mpi_multibw (based on the OSU osu_bw, but changed
to do messaging rate), running 8 copies per dual-core 4-socket opteron,
both on InfiniPath MPI, and MVAPICH (built for gen2).
We ship the source with our upcoming release, and will probably make
it available outside our release.
We did discover one possible problem today, which is shared between
our device code and the core openib code, and that's doing some
memory freeing and accounting from a work thread (updating mm->locked_vm
and cleaning up from earlier get_user_pages); the code in our driver
was copied from the openib core code, it's not literally shared.
I have a strong suspicion that at least sometimes, it's executing after
the current->mm has gone away. I'm looking at that more right now.
| It doesn't seem likely that ipath changes are going to fix a generic
| bug like this...
It wasn't an attempt to fix it, so much as to work around it, while
I worked on other higher priority stuff. As I mentioned, it also helps
a bit in allowing multiple processes to be in the open and close code
simultaneously, when you have multiple cpus, so even on that basis,
I'd probably leave it as it now is.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/