Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[ptp-dev] Problems starting SDM processes


I made a little more progress on getting the SDM processes started with my proxy and a 4 task PE application. My proxy invokes the child SDMs first, by 'poe SDM ...' then almost immediately invokes the top level SDM. This was failing for me with the top level SDM issuing the following messages

debug: waiting for connect
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 17257
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 16058
PE@k17sf2p03 (RDT): SDM[2]: [4] No port found for the sdm child. hostname: k17sf2p03
PE@k17sf2p03 (RDT): SDM[1]: sdm_init failed

I modified my proxy to sleep 5 seconds before starting the top level SDM and got a little further. So I think that where the SDM tries to connect to child processes, it might need to retry a connection for some length of time or otherwise deal with the fact that the child SDMs might not be running at the time the top SDM expects to find them. I can probably also handle this in my proxy with some restructuring by waiting until after the PE attach.cfg file has been created before starting the top SDM, since once that file is created, I am guranteed that the SDM processes have all been started.

I think this is also a problem in the child SDM case since I had one failure where I saw the same message about no port, probably because only some of the SDMs were started when the top SDM tried to start the process of connecing the SDMs together in a tree.

Once I changed my proxy to sleep 5 seconds, I got a little farther. Now I get the following messages from the top SDM

debug: waiting for connect
PE@k17sf2p03 (RDT): SDM[2]: [0] size 4
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 0 is {2}
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 1 is {}
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 3 is {}
PE@k17sf2p03 (RDT): SDM[2]: [4] in sdm_create_sockd_map
PE@k17sf2p03 (RDT): SDM[2]: [4] sdm_route_get_route dest {0-4}, parent 4
PE@k17sf2p03 (RDT): SDM[2]: [4] adjacent nodes: {0-1,3}
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 0 to my map
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 1 to my map
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 3 to my map
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 14604
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 17781
PE@k17sf2p03 (RDT): nodeID: 0, hostname: k17sf2p03, port: 14313
PE@k17sf2p03 (RDT): nodeID: 3, hostname: k17sf2p03, port: 18067
PE@k17sf2p03 (RDT): SDM[2]: [4] Initialization successful
PE@k17sf2p03 (RDT): SDM[1]: starting client

Very shortly afterwards I get a popup 'Master SDM control has encountered a problem. sdm master process finished with exit code 1.

I think the GUI then tries to terminate the child SDMs (my poe process) since the very next message logged by my proxy is the following

PACKET:[00000017PE@k17sf2p03 (RDT): 08/28 22:39:46 T(256) Trace: >>> terminate_job entered. (Line 1416)

terminate_job is only called when my proxy receives a request from the GUI to kill a job. So I'm guessing that the GUI detected the top SDM exited and is attempting cleanup.

The log where I capture messages from the child SDMs has the following messages
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250)
08/28 22:39:38 T(256) Trace: >>> setup_child_stdio entered. (Line 3235)
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250)
08/28 22:39:38 T(256) Trace: Target env[0]: MP_LABELIO=yes
08/28 22:39:38 T(256) Trace: Target env[1]: MP_PROCS=4
08/28 22:39:38 T(256) Trace: Target env[2]: MP_HOSTFILE=/home/wootton/hostfile.rh
08/28 22:39:38 T(256) Trace: Target env[3]: MP_BUFFER_MEM=64M
08/28 22:39:38 T(256) Trace: Target env[4]: MP_RESD=no
08/28 22:39:38 T(256) Trace: Target arg[0]: poe
08/28 22:39:38 T(256) Trace: Target arg[1]: /home/wootton/ptp/org.eclipse.ptp.debug.sdm/sdm
08/28 22:39:38 T(256) Trace: Target arg[2]: --debug
08/28 22:39:38 T(256) Trace: Target arg[3]: --debugger=gdb-mi
08/28 22:39:38 T(256) Trace: Target arg[4]: --numprocs=4
08/28 22:39:38 T(256) Trace: +++ Ready to invoke child process
   0:SDM[2]: [0] size 4
   0:SDM[2]: [0] route for 2 is {}
   0:SDM[2]: [0] in sdm_create_sockd_map
   0:SDM[2]: [0] sdm_route_get_route dest {0-3}, parent 4
   0:SDM[2]: [0] adjacent nodes: {2}
   0:SDM[2]: [0] adding 2 to my map
   1:SDM[2]: [0] size 4
   1:SDM[2]: [1] in sdm_create_sockd_map
   1:SDM[2]: [1] sdm_route_get_route dest {0-3}, parent 4
   1:SDM[2]: [1] adjacent nodes: {}
   2:SDM[2]: [0] size 4
   2:SDM[2]: [2] in sdm_create_sockd_map
   2:SDM[2]: [2] sdm_route_get_route dest {1-3}, parent 0
   2:SDM[2]: [2] adjacent nodes: {}
   3:SDM[2]: [0] size 4
   3:SDM[2]: [3] in sdm_create_sockd_map
   3:SDM[2]: [3] sdm_route_get_route dest {0-3}, parent 4
   3:SDM[2]: [3] adjacent nodes: {}
   0:effsize: 5, size: 4, rv: 0
   0:nodeID: 2, hostname: k17sf2p03, port: 14604
   0:nodeID: 1, hostname: k17sf2p03, port: 17781
   0:nodeID: 0, hostname: k17sf2p03, port: 14313
   0:nodeID: 2, hostname: k17sf2p03, port: 14604
   0:nodeID: 1, hostname: k17sf2p03, port: 17781
   0:nodeID: 0, hostname: k17sf2p03, port: 14313
   0:nodeID: 3, hostname: k17sf2p03, port: 18067
   0:SDM[2]: [0] Initialization successful
   0:SDM[1]: starting task 0
   0:SDM[4]: starting server on [0,5]
   1:effsize: 5, size: 4, rv: 0
   1:nodeID: 2, hostname: k17sf2p03, port: 14604
   1:nodeID: 1, hostname: k17sf2p03, port: 17781
   1:SDM[2]: [1] Initialization successful
   1:SDM[1]: starting task 1
   1:SDM[4]: starting server on [1,5]
   2:effsize: 5, size: 4, rv: 0
   2:nodeID: 2, hostname: k17sf2p03, port: 14604
   2:SDM[2]: [2] Initialization successful
   2:SDM[1]: starting task 2
   2:SDM[4]: starting server on [2,5]
   3:effsize: 5, size: 4, rv: 0
   3:nodeID: 2, hostname: k17sf2p03, port: 14604
   3:nodeID: 1, hostname: k17sf2p03, port: 17781
   3:nodeID: 0, hostname: k17sf2p03, port: 14313
   3:nodeID: 3, hostname: k17sf2p03, port: 18067
   3:SDM[2]: [3] Initialization successful
   3:SDM[1]: starting task 3
   3:SDM[4]: starting server on [3,5]

So it looks like the child SDMs are not connecting into a tree for some reason.

Note: In the child SDM messages, the '0:, 1:, 2:, and 3: are the MPI task rank of each of the SDM processes. There's an option I turned on for PE so taht each line of output is labeled with task index. In this case it hilites each task's SDM processing.


Dave

Back to the top