SOS!!! Message Passing Error!

From: HE Hao (haohe@me1.eng.wayne.edu)
Date: Thu May 11 2000 - 22:05:05 EST


Dear friends,

I need your help!

I am a graduate student and doing some parallel scientific computing by using a group of PC's
installed Linux and message passing library MPI now. I just got some trouble in my computing
and have no idea to solve it.

I try to calculate a problem by many times of iteration. And in every iteration, every computer node
send and receive messages to and from its neighbours.
When the size of problem is small, that means less memory allocated and less data exchanged
in every iteration, everything seems OK. But when I allocate large memory to calculate large problem,
I encountered a fatal error. ( error message attached)
It is so strange that this error always happened after some times of success iteration
and you don't know when it will happen.
It seems the error ocours in exchanging message. Connection timed out.

I don't know it is due to Linux or the network hardware.
In this case, the exchanged data are very large, about 2.5M bytes between any pair of nodes
in each iteration. Considering that it take only several seconds in every iteration, it is a heavy duty
to the network.

My system:
Eight Pentium III PC's with 3Com 3C905 100M Ethernet Card, Intel Express 510T Switch,
SuSE 6.1 (kernel 2.2.7), MPICH 1.2

Since I am not a Linux newbie, I am looking for help here.
Thank you for your patience and any advice or suggestion will be highly appreciated.
Thank you very much!

Here are the error messages, displayed after about 5 hours of success work.
And then the computing stoped and I can not got any result.

......
 IT= 5000
   0.004311 0.008730 0.008730 0.011694 0.400392 0.003235
 My Rank= 3
p3_25675: (19533.719481) net_send: could not write to fd=6, errno = 110
    p4_error: latest msg from perror: Connection timed out
 My Rank= 2
net_recv failed for fd = 9
p2_25535: p4_error: net_recv read, errno = : 104
rm_l_2_25541: p4_error: interrupt SIGINT: 2
p3_25675: p4_error: net_send write: -1
rm_l_3_25681: p4_error: interrupt SIGINT: 2
rm_l_1_25504: p4_error: net_recv read: probable EOF on socket: 1
p0_28654: p4_error: interrupt SIGINT: 2
 My Rank= 6
 My Rank= 5
p6_25028: p4_error: interrupt SIGINT: 2
rm_l_6_25034: p4_error: interrupt SIGINT: 2
rm_l_5_25034: p4_error: interrupt SIGINT: 2
p5_25028: p4_error: interrupt SIGINT: 2
 My Rank= 7
p7_25018: (21294.568454) Trying to receive a message when there are no connectio
ns; Bailing out
 My Rank= 4
p4_25039: (21295.231037) Trying to receive a message when there are no connectio
ns; Bailing out
bm_list_28655: p4_error: interrupt SIGINT: 2
$_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:17 EST