Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Problems starting SDM processes

Dave,

I'm pretty sure the problem is that you're not setting the jobNumProcs attribute on the newly created job. It looks like the debugger uses this to work out how many processes are being debugged. I've modified the proxy routines so that you have to specify the number of processes when you create a new job so this attribute is automatically created, but haven't been able to get it to work with your proxy yet since you've changed it quite a bit from the CVS version. I'll keep working on it and let you know when I get it going.

Greg

On Sep 2, 2008, at 9:03 AM, Dave Wootton wrote:

Greg
I was able to put debugging code into the sdm_connect_to_child and
sdm_tcp_init to debug the first problem. The top level SDM is iterating
over port numbers attempting to connect to child SDMs and getting a
'connect refused' message. Eventually it runs out of ports and exits. I
think this is happening because I start the top SDM too soon after
starting the poe process to create the child SDMs. If I change my proxy to wait 5 seconds before starting the top SDM, then this problem is solved. I can change my proxy so that the top proxy is started after all the child SDMs are started (by waiting until my attach.cfg file is created) without
too much trouble if you think that's reasonable.

I tried putting debug code into the event loops to see why the SDMs were apparently shutting down prematurely, but I either couldn't get the debug code to work or just didn't understand enough of the main event loop logic
to put the debug code in the right place?

Can you look at this if you get a chance? I'll send you a separate email
with details how to get to my system.

Thanks
Dave



Greg Watson <g.watson@xxxxxxxxxxxx>
Sent by: ptp-dev-bounces@xxxxxxxxxxx
08/29/2008 07:52 AM
Please respond to
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>


To
Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
cc

Subject
Re: [ptp-dev] Problems starting SDM processes






Dave,

Can you add some debugging to sdm_connect_to_child and sdm_tcpip_init to
see what's happening? Of if it would be easier, I could take a look on
your machine.

Greg

On Aug 29, 2008, at 7:11 AM, Dave Wootton wrote:


I made a little more progress on getting the SDM processes started with my proxy and a 4 task PE application. My proxy invokes the child SDMs first, by 'poe SDM ...' then almost immediately invokes the top level SDM. This was failing for me with the top level SDM issuing the following messages

debug: waiting for connect
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 17257
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 16058
PE@k17sf2p03 (RDT): SDM[2]: [4] No port found for the sdm child. hostname:
k17sf2p03
PE@k17sf2p03 (RDT): SDM[1]: sdm_init failed

I modified my proxy to sleep 5 seconds before starting the top level SDM and got a little further. So I think that where the SDM tries to connect to child processes, it might need to retry a connection for some length of
time or otherwise deal with the fact that the child SDMs might not be
running at the time the top SDM expects to find them. I can probably also handle this in my proxy with some restructuring by waiting until after the
PE attach.cfg file has been created before starting the top SDM, since
once that file is created, I am guranteed that the SDM processes have all
been started.

I think this is also a problem in the child SDM case since I had one
failure where I saw the same message about no port, probably because only some of the SDMs were started when the top SDM tried to start the process
of connecing the SDMs together in a tree.

Once I changed my proxy to sleep 5 seconds, I got a little farther. Now I
get the following messages from the top SDM

debug: waiting for connect
PE@k17sf2p03 (RDT): SDM[2]: [0] size 4
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 0 is {2}
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 1 is {}
PE@k17sf2p03 (RDT): SDM[2]: [4] route for 3 is {}
PE@k17sf2p03 (RDT): SDM[2]: [4] in sdm_create_sockd_map
PE@k17sf2p03 (RDT): SDM[2]: [4] sdm_route_get_route dest {0-4}, parent 4
PE@k17sf2p03 (RDT): SDM[2]: [4] adjacent nodes: {0-1,3}
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 0 to my map
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 1 to my map
PE@k17sf2p03 (RDT): SDM[2]: [4] adding 3 to my map
PE@k17sf2p03 (RDT): effsize: 5, size: 4, rv: 0
PE@k17sf2p03 (RDT): nodeID: 2, hostname: k17sf2p03, port: 14604
PE@k17sf2p03 (RDT): nodeID: 1, hostname: k17sf2p03, port: 17781
PE@k17sf2p03 (RDT): nodeID: 0, hostname: k17sf2p03, port: 14313
PE@k17sf2p03 (RDT): nodeID: 3, hostname: k17sf2p03, port: 18067
PE@k17sf2p03 (RDT): SDM[2]: [4] Initialization successful
PE@k17sf2p03 (RDT): SDM[1]: starting client

Very shortly afterwards I get a popup 'Master SDM control has encountered
a problem. sdm master process finished with exit code 1.

I think the GUI then tries to terminate the child SDMs (my poe process)
since the very next message logged by my proxy is the following

PACKET:[00000017PE@k17sf2p03 (RDT): 08/28 22:39:46 T(256) Trace: >>>
terminate_job entered. (Line 1416)

terminate_job is only called when my proxy receives a request from the GUI to kill a job. So I'm guessing that the GUI detected the top SDM exited
and is attempting cleanup.

The log where I capture messages from the child SDMs has the following
messages
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250)
08/28 22:39:38 T(256) Trace: >>> setup_child_stdio entered. (Line 3235)
08/28 22:39:38 T(256) Trace: <<< setup_child_stdio exited. (Line 3250)
08/28 22:39:38 T(256) Trace: Target env[0]: MP_LABELIO=yes
08/28 22:39:38 T(256) Trace: Target env[1]: MP_PROCS=4
08/28 22:39:38 T(256) Trace: Target env[2]:
MP_HOSTFILE=/home/wootton/hostfile.rh
08/28 22:39:38 T(256) Trace: Target env[3]: MP_BUFFER_MEM=64M
08/28 22:39:38 T(256) Trace: Target env[4]: MP_RESD=no
08/28 22:39:38 T(256) Trace: Target arg[0]: poe
08/28 22:39:38 T(256) Trace: Target arg[1]:
/home/wootton/ptp/org.eclipse.ptp.debug.sdm/sdm
08/28 22:39:38 T(256) Trace: Target arg[2]: --debug
08/28 22:39:38 T(256) Trace: Target arg[3]: --debugger=gdb-mi
08/28 22:39:38 T(256) Trace: Target arg[4]: --numprocs=4
08/28 22:39:38 T(256) Trace: +++ Ready to invoke child process
  0:SDM[2]: [0] size 4
  0:SDM[2]: [0] route for 2 is {}
  0:SDM[2]: [0] in sdm_create_sockd_map
  0:SDM[2]: [0] sdm_route_get_route dest {0-3}, parent 4
  0:SDM[2]: [0] adjacent nodes: {2}
  0:SDM[2]: [0] adding 2 to my map
  1:SDM[2]: [0] size 4
  1:SDM[2]: [1] in sdm_create_sockd_map
  1:SDM[2]: [1] sdm_route_get_route dest {0-3}, parent 4
  1:SDM[2]: [1] adjacent nodes: {}
  2:SDM[2]: [0] size 4
  2:SDM[2]: [2] in sdm_create_sockd_map
  2:SDM[2]: [2] sdm_route_get_route dest {1-3}, parent 0
  2:SDM[2]: [2] adjacent nodes: {}
  3:SDM[2]: [0] size 4
  3:SDM[2]: [3] in sdm_create_sockd_map
  3:SDM[2]: [3] sdm_route_get_route dest {0-3}, parent 4
  3:SDM[2]: [3] adjacent nodes: {}
  0:effsize: 5, size: 4, rv: 0
  0:nodeID: 2, hostname: k17sf2p03, port: 14604
  0:nodeID: 1, hostname: k17sf2p03, port: 17781
  0:nodeID: 0, hostname: k17sf2p03, port: 14313
  0:nodeID: 2, hostname: k17sf2p03, port: 14604
  0:nodeID: 1, hostname: k17sf2p03, port: 17781
  0:nodeID: 0, hostname: k17sf2p03, port: 14313
  0:nodeID: 3, hostname: k17sf2p03, port: 18067
  0:SDM[2]: [0] Initialization successful
  0:SDM[1]: starting task 0
  0:SDM[4]: starting server on [0,5]
  1:effsize: 5, size: 4, rv: 0
  1:nodeID: 2, hostname: k17sf2p03, port: 14604
  1:nodeID: 1, hostname: k17sf2p03, port: 17781
  1:SDM[2]: [1] Initialization successful
  1:SDM[1]: starting task 1
  1:SDM[4]: starting server on [1,5]
  2:effsize: 5, size: 4, rv: 0
  2:nodeID: 2, hostname: k17sf2p03, port: 14604
  2:SDM[2]: [2] Initialization successful
  2:SDM[1]: starting task 2
  2:SDM[4]: starting server on [2,5]
  3:effsize: 5, size: 4, rv: 0
  3:nodeID: 2, hostname: k17sf2p03, port: 14604
  3:nodeID: 1, hostname: k17sf2p03, port: 17781
  3:nodeID: 0, hostname: k17sf2p03, port: 14313
  3:nodeID: 3, hostname: k17sf2p03, port: 18067
  3:SDM[2]: [3] Initialization successful
  3:SDM[1]: starting task 3
  3:SDM[4]: starting server on [3,5]

So it looks like the child SDMs are not connecting into a tree for some
reason.

Note: In the child SDM messages, the '0:, 1:, 2:, and 3: are the MPI task rank of each of the SDM processes. There's an option I turned on for PE so
taht each line of output is labeled with task index. In this case it
hilites each task's SDM processing.


Dave_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev
_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev


_______________________________________________
ptp-dev mailing list
ptp-dev@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-dev




Back to the top