Greg
I was able to put debugging code into the sdm_connect_to_child and
sdm_tcp_init to debug the first problem. The top level SDM is
iterating
over port numbers attempting to connect to child SDMs and getting a
'connect refused' message. Eventually it runs out of ports and
exits. I
think this is happening because I start the top SDM too soon after
starting the poe process to create the child SDMs. If I change my
proxy to
wait 5 seconds before starting the top SDM, then this problem is
solved. I
can change my proxy so that the top proxy is started after all the
child
SDMs are started (by waiting until my attach.cfg file is created)
without
too much trouble if you think that's reasonable.
I tried putting debug code into the event loops to see why the SDMs
were
apparently shutting down prematurely, but I either couldn't get the
debug
code to work or just didn't understand enough of the main event loop
logic
to put the debug code in the right place?
Can you look at this if you get a chance? I'll send you a separate
email
with details how to get to my system.
Thanks
Dave
Greg Watson <g.watson@xxxxxxxxxxxx>
Sent by: ptp-dev-bounces@xxxxxxxxxxx
08/29/2008 07:52 AM
Please respond to
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
To
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
cc
Subject
Re: [ptp-dev] Problems starting SDM processes
Dave,
Can you add some debugging to sdm_connect_to_child and
sdm_tcpip_init to
see what's happening? Of if it would be easier, I could take a look on
your machine.
Greg
On Aug 29, 2008, at 7:11 AM, Dave Wootton wrote:
I made a little more progress on getting the SDM processes started
with my
proxy and a 4 task PE application. My proxy invokes the child SDMs
first,
by 'poe SDM ...' then almost immediately invokes the top level SDM.
This
was failing for me with the top level SDM issuing the following
messages
debug: waiting for connect
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 17257
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 16058
PE@k17sf2p03 (RDT): SDM[2]: [4] No port found for the sdm child.
hostname:
k17sf2p03
PE@k17sf2p03 (RDT): SDM[1]: sdm_init failed
I modified my proxy to sleep 5 seconds before starting the top level
SDM
and got a little further. So I think that where the SDM tries to
connect
to child processes, it might need to retry a connection for some
length of
time or otherwise deal with the fact that the child SDMs might not be
running at the time the top SDM expects to find them. I can probably
also
handle this in my proxy with some restructuring by waiting until
after the
PE attach.cfg file has been created before starting the top SDM, since
once that file is created, I am guranteed that the SDM processes
have all
been started.
I think this is also a problem in the child SDM case since I had one
failure where I saw the same message about no port, probably because
only
some of the SDMs were started when the top SDM tried to start the
process
of connecing the SDMs together in a tree.
Once I changed my proxy to sleep 5 seconds, I got a little farther.
Now I
get the following messages from the top SDM
debug: waiting for connect
PE@k17sf2p03 (RDT): SDM[2]: [0] size 4
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 0 is {2}
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 1 is {}
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 3 is {}
PE@k17sf2p03 (RDT): SDM[2]: [4] in sdm_create_sockd_map
PE@k17sf2p03 (RDT): SDM[2]: [4] sdm_route_get_route dest {0-4},
parent 4
PE@k17sf2p03 (RDT): SDM[2]: [4] adjacent nodes: {0-1,3}
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 0 to my map
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 1 to my map
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 3 to my map
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 14604
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 17781
PE@k17sf2p03 (RDT): nodeID: 0, hostname: k17sf2p03, port: 14313
PE@k17sf2p03 (RDT): nodeID: 3, hostname: k17sf2p03, port: 18067
PE@k17sf2p03 (RDT): SDM[2]: [4] Initialization successful
PE@k17sf2p03 (RDT): SDM[1]: starting client
Very shortly afterwards I get a popup 'Master SDM control has
encountered
a problem. sdm master process finished with exit code 1.
I think the GUI then tries to terminate the child SDMs (my poe
process)
since the very next message logged by my proxy is the following
PACKET:[00000017PE@k17sf2p03 (RDT): 08/28 22:39:46 T(256) Trace: >>>
terminate_job entered. (Line 1416)
terminate_job is only called when my proxy receives a request from
the GUI
to kill a job. So I'm guessing that the GUI detected the top SDM
exited
and is attempting cleanup.
The log where I capture messages from the child SDMs has the following
messages
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250)
08/28 22:39:38 T(256) Trace: >>> setup_child_stdio entered. (Line
3235)
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250)
08/28 22:39:38 T(256) Trace: Target env[0]: MP_LABELIO=yes
08/28 22:39:38 T(256) Trace: Target env[1]: MP_PROCS=4
08/28 22:39:38 T(256) Trace: Target env[2]:
MP_HOSTFILE=/home/wootton/hostfile.rh
08/28 22:39:38 T(256) Trace: Target env[3]: MP_BUFFER_MEM=64M
08/28 22:39:38 T(256) Trace: Target env[4]: MP_RESD=no
08/28 22:39:38 T(256) Trace: Target arg[0]: poe
08/28 22:39:38 T(256) Trace: Target arg[1]:
/home/wootton/ptp/org.eclipse.ptp.debug.sdm/sdm
08/28 22:39:38 T(256) Trace: Target arg[2]: --debug
08/28 22:39:38 T(256) Trace: Target arg[3]: --debugger=gdb-mi
08/28 22:39:38 T(256) Trace: Target arg[4]: --numprocs=4
08/28 22:39:38 T(256) Trace: +++ Ready to invoke child process
0:SDM[2]: [0] size 4
0:SDM[2]: [0] route for 2 is {}
0:SDM[2]: [0] in sdm_create_sockd_map
0:SDM[2]: [0] sdm_route_get_route dest {0-3}, parent 4
0:SDM[2]: [0] adjacent nodes: {2}
0:SDM[2]: [0] adding 2 to my map
1:SDM[2]: [0] size 4
1:SDM[2]: [1] in sdm_create_sockd_map
1:SDM[2]: [1] sdm_route_get_route dest {0-3}, parent 4
1:SDM[2]: [1] adjacent nodes: {}
2:SDM[2]: [0] size 4
2:SDM[2]: [2] in sdm_create_sockd_map
2:SDM[2]: [2] sdm_route_get_route dest {1-3}, parent 0
2:SDM[2]: [2] adjacent nodes: {}
3:SDM[2]: [0] size 4
3:SDM[2]: [3] in sdm_create_sockd_map
3:SDM[2]: [3] sdm_route_get_route dest {0-3}, parent 4
3:SDM[2]: [3] adjacent nodes: {}
0:effsize: 5, size: 4, rv: 0
0:nodeID: 2, hostname: k17sf2p03, port: 14604
0:nodeID: 1, hostname: k17sf2p03, port: 17781
0:nodeID: 0, hostname: k17sf2p03, port: 14313
0:nodeID: 2, hostname: k17sf2p03, port: 14604
0:nodeID: 1, hostname: k17sf2p03, port: 17781
0:nodeID: 0, hostname: k17sf2p03, port: 14313
0:nodeID: 3, hostname: k17sf2p03, port: 18067
0:SDM[2]: [0] Initialization successful
0:SDM[1]: starting task 0
0:SDM[4]: starting server on [0,5]
1:effsize: 5, size: 4, rv: 0
1:nodeID: 2, hostname: k17sf2p03, port: 14604
1:nodeID: 1, hostname: k17sf2p03, port: 17781
1:SDM[2]: [1] Initialization successful
1:SDM[1]: starting task 1
1:SDM[4]: starting server on [1,5]
2:effsize: 5, size: 4, rv: 0
2:nodeID: 2, hostname: k17sf2p03, port: 14604
2:SDM[2]: [2] Initialization successful
2:SDM[1]: starting task 2
2:SDM[4]: starting server on [2,5]
3:effsize: 5, size: 4, rv: 0
3:nodeID: 2, hostname: k17sf2p03, port: 14604
3:nodeID: 1, hostname: k17sf2p03, port: 17781
3:nodeID: 0, hostname: k17sf2p03, port: 14313
3:nodeID: 3, hostname: k17sf2p03, port: 18067
3:SDM[2]: [3] Initialization successful
3:SDM[1]: starting task 3
3:SDM[4]: starting server on [3,5]
So it looks like the child SDMs are not connecting into a tree for
some
reason.
Note: In the child SDM messages, the '0:, 1:, 2:, and 3: are the MPI
task
rank of each of the SDM processes. There's an option I turned on for
PE so
taht each line of output is labeled with task index. In this case it
hilites each task's SDM processing.
Dave_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev