Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Problems starting SDM processes

Greg
I was able to put debugging code into the sdm_connect_to_child and 
sdm_tcp_init to debug the first problem. The top level SDM is iterating 
over port numbers attempting to connect to child SDMs and getting a 
'connect refused' message. Eventually it runs out of ports and exits. I 
think this is happening because I start the top SDM too soon after 
starting the poe process to create the child SDMs. If I change my proxy to 
wait 5 seconds before starting the top SDM, then this problem is solved. I 
can change my proxy so that the top proxy is started after all the child 
SDMs are started (by waiting until my attach.cfg file is created) without 
too much trouble if you think that's reasonable.

I tried putting debug code into the event loops to see why the SDMs were 
apparently shutting down prematurely, but I either couldn't get the debug 
code to work or just didn't understand enough of the main event loop logic 
to put the debug code in the right place?

Can you look at this if you get a chance? I'll send you a separate email 
with details how to get to my system.

Thanks
Dave



Greg Watson <g.watson@xxxxxxxxxxxx> 
Sent by: ptp-dev-bounces@xxxxxxxxxxx
08/29/2008 07:52 AM
Please respond to
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>


To
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
cc

Subject
Re: [ptp-dev] Problems starting SDM processes






Dave,

Can you add some debugging to sdm_connect_to_child and sdm_tcpip_init to 
see what's happening? Of if it would be easier, I could take a look on 
your machine.

Greg

On Aug 29, 2008, at 7:11 AM, Dave Wootton wrote:


I made a little more progress on getting the SDM processes started with my 
proxy and a 4 task PE application. My proxy invokes the child SDMs first, 
by 'poe SDM ...' then almost immediately invokes the top level SDM. This 
was failing for me with the top level SDM issuing the following messages 

debug: waiting for connect 
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0 
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 17257 
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 16058 
PE@k17sf2p03 (RDT): SDM[2]: [4] No port found for the sdm child. hostname: 
k17sf2p03 
PE@k17sf2p03 (RDT): SDM[1]: sdm_init failed 

I modified my proxy to sleep 5 seconds before starting the top level SDM 
and got a little further. So I think that where the SDM tries to connect 
to child processes, it might need to retry a connection for some length of 
time or otherwise deal with the fact that the child SDMs might not be 
running at the time the top SDM expects to find them. I can probably also 
handle this in my proxy with some restructuring by waiting until after the 
PE attach.cfg file has been created before starting the top SDM, since 
once that file is created, I am guranteed that the SDM processes have all 
been started. 

I think this is also a problem in the child SDM case since I had one 
failure where I saw the same message about no port, probably because only 
some of the SDMs were started when the top SDM tried to start the process 
of connecing the SDMs together in a tree. 

Once I changed my proxy to sleep 5 seconds, I got a little farther. Now I 
get the following messages from the top SDM 

debug: waiting for connect 
PE@k17sf2p03 (RDT): SDM[2]: [0] size 4 
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 0 is {2} 
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 1 is {} 
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 3 is {} 
PE@k17sf2p03 (RDT): SDM[2]: [4] in sdm_create_sockd_map 
PE@k17sf2p03 (RDT): SDM[2]: [4] sdm_route_get_route dest {0-4}, parent 4 
PE@k17sf2p03 (RDT): SDM[2]: [4] adjacent nodes: {0-1,3} 
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 0 to my map 
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 1 to my map 
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 3 to my map 
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0 
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 14604 
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 17781 
PE@k17sf2p03 (RDT): nodeID: 0, hostname: k17sf2p03, port: 14313 
PE@k17sf2p03 (RDT): nodeID: 3, hostname: k17sf2p03, port: 18067 
PE@k17sf2p03 (RDT): SDM[2]: [4] Initialization successful 
PE@k17sf2p03 (RDT): SDM[1]: starting client 

Very shortly afterwards I get a popup 'Master SDM control has encountered 
a problem. sdm master process finished with exit code 1. 

I think the GUI then tries to terminate the child SDMs (my poe process) 
since the very next message logged by my proxy is the following 

PACKET:[00000017PE@k17sf2p03 (RDT): 08/28 22:39:46 T(256) Trace: >>> 
terminate_job entered. (Line 1416) 

terminate_job is only called when my proxy receives a request from the GUI 
to kill a job. So I'm guessing that the GUI detected the top SDM exited 
and is attempting cleanup. 

The log where I capture messages from the child SDMs has the following 
messages 
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250) 
08/28 22:39:38 T(256) Trace: >>> setup_child_stdio entered. (Line 3235) 
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250) 
08/28 22:39:38 T(256) Trace: Target env[0]: MP_LABELIO=yes 
08/28 22:39:38 T(256) Trace: Target env[1]: MP_PROCS=4 
08/28 22:39:38 T(256) Trace: Target env[2]: 
MP_HOSTFILE=/home/wootton/hostfile.rh 
08/28 22:39:38 T(256) Trace: Target env[3]: MP_BUFFER_MEM=64M 
08/28 22:39:38 T(256) Trace: Target env[4]: MP_RESD=no 
08/28 22:39:38 T(256) Trace: Target arg[0]: poe 
08/28 22:39:38 T(256) Trace: Target arg[1]: 
/home/wootton/ptp/org.eclipse.ptp.debug.sdm/sdm 
08/28 22:39:38 T(256) Trace: Target arg[2]: --debug 
08/28 22:39:38 T(256) Trace: Target arg[3]: --debugger=gdb-mi 
08/28 22:39:38 T(256) Trace: Target arg[4]: --numprocs=4 
08/28 22:39:38 T(256) Trace: +++ Ready to invoke child process 
   0:SDM[2]: [0] size 4 
   0:SDM[2]: [0] route for 2 is {} 
   0:SDM[2]: [0] in sdm_create_sockd_map 
   0:SDM[2]: [0] sdm_route_get_route dest {0-3}, parent 4 
   0:SDM[2]: [0] adjacent nodes: {2} 
   0:SDM[2]: [0] adding 2 to my map 
   1:SDM[2]: [0] size 4 
   1:SDM[2]: [1] in sdm_create_sockd_map 
   1:SDM[2]: [1] sdm_route_get_route dest {0-3}, parent 4 
   1:SDM[2]: [1] adjacent nodes: {} 
   2:SDM[2]: [0] size 4 
   2:SDM[2]: [2] in sdm_create_sockd_map 
   2:SDM[2]: [2] sdm_route_get_route dest {1-3}, parent 0 
   2:SDM[2]: [2] adjacent nodes: {} 
   3:SDM[2]: [0] size 4 
   3:SDM[2]: [3] in sdm_create_sockd_map 
   3:SDM[2]: [3] sdm_route_get_route dest {0-3}, parent 4 
   3:SDM[2]: [3] adjacent nodes: {} 
   0:effsize: 5, size: 4, rv: 0 
   0:nodeID: 2, hostname: k17sf2p03, port: 14604 
   0:nodeID: 1, hostname: k17sf2p03, port: 17781 
   0:nodeID: 0, hostname: k17sf2p03, port: 14313 
   0:nodeID: 2, hostname: k17sf2p03, port: 14604 
   0:nodeID: 1, hostname: k17sf2p03, port: 17781 
   0:nodeID: 0, hostname: k17sf2p03, port: 14313 
   0:nodeID: 3, hostname: k17sf2p03, port: 18067 
   0:SDM[2]: [0] Initialization successful 
   0:SDM[1]: starting task 0 
   0:SDM[4]: starting server on [0,5] 
   1:effsize: 5, size: 4, rv: 0 
   1:nodeID: 2, hostname: k17sf2p03, port: 14604 
   1:nodeID: 1, hostname: k17sf2p03, port: 17781 
   1:SDM[2]: [1] Initialization successful 
   1:SDM[1]: starting task 1 
   1:SDM[4]: starting server on [1,5] 
   2:effsize: 5, size: 4, rv: 0 
   2:nodeID: 2, hostname: k17sf2p03, port: 14604 
   2:SDM[2]: [2] Initialization successful 
   2:SDM[1]: starting task 2 
   2:SDM[4]: starting server on [2,5] 
   3:effsize: 5, size: 4, rv: 0 
   3:nodeID: 2, hostname: k17sf2p03, port: 14604 
   3:nodeID: 1, hostname: k17sf2p03, port: 17781 
   3:nodeID: 0, hostname: k17sf2p03, port: 14313 
   3:nodeID: 3, hostname: k17sf2p03, port: 18067 
   3:SDM[2]: [3] Initialization successful 
   3:SDM[1]: starting task 3 
   3:SDM[4]: starting server on [3,5] 

So it looks like the child SDMs are not connecting into a tree for some 
reason. 

Note: In the child SDM messages, the '0:, 1:, 2:, and 3: are the MPI task 
rank of each of the SDM processes. There's an option I turned on for PE so 
taht each line of output is labeled with task index. In this case it 
hilites each task's SDM processing. 


Dave_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev




Back to the top