Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-user] PTP debugging with SLURM (was mpich support in luna?)

another thing i have to do for this setup is manually kill some processes on the node i start my job from before i can run another debug job


thomasge  17317  0.0  0.0 138924  2328 ?        S    09:32   0:00 sshd: thomasge@notty
thomasge  17631  0.0  0.0  69412  2760 ?        Ss   09:32   0:00 /usr/libexec/openssh/sftp-server

[thomasge@gcn2 ~]$ kill -9 17317 17631

otherwise my job will not start under control of sdm but will terminate as soon as sdm master and sdm on the compute node(s) have connected

this pid: 20130
waiting: 20129 20130
#PTP job_id=0
#launchMode=debug
launchCommand: salloc --time=01:00:00 --partition=gpu_short --ntasks=2 --ntasks-per-node=1 mpirun -np 2 -mca orte_show_resolved_nodenames 1 -display-map /nfs/home1/thomasge/.eclipsesettings/sdm --port=39704 --host=localhost --debugger=gdb-mi --debug=13 --routing_file=/nfs/home1/thomasge/source/testmpitype/routing_file
line: salloc: Granted job allocation 1151231
salloc: Granted job allocation 1151231
line:  Data for JOB [32579,1] offset 0
 Data for JOB [32579,1] offset 0
line: 

line:  ========================   JOB MAP   ========================
found job map:  ========================   JOB MAP   ========================
found node gcn64, procs 1
found proc 0
found end of node map
found node gcn65, procs 1
found proc 1
found end of node map
found end of table
line: SDM: [server] effsize: 3, size: 2, rv: 0
SDM: [server] effsize: 3, size: 2, rv: 0
line: SDM: [server] Found routing file, size=2
SDM: [server] Found routing file, size=2
line: SDM: [1] size 3
SDM: [1] size 3
line: SDM: [1] sdm_route_get_route dest {0-1}, parent 2
SDM: [1] sdm_route_get_route dest {0-1}, parent 2
line: SDM: [server] effsize: 3, size: 2, rv: 0
SDM: [server] effsize: 3, size: 2, rv: 0
line: SDM: [1] nodeID: 0, hostname: gcn64, port: 59269
SDM: [1] nodeID: 0, hostname: gcn64, port: 59269
line: SDM: [1] nodeID: 1, hostname: gcn65, port: 55920
SDM: [1] nodeID: 1, hostname: gcn65, port: 55920
line: SDM: [server] effsize: 3, size: 2, rv: 0
SDM: [server] effsize: 3, size: 2, rv: 0
line: SDM: [server] Found routing file, size=2
SDM: [server] Found routing file, size=2
line: SDM: [0] size 3
SDM: [0] size 3
line: SDM: [0] sdm_route_get_route dest {0-1}, parent 2
SDM: [0] sdm_route_get_route dest {0-1}, parent 2
line: SDM: [server] effsize: 3, size: 2, rv: 0
SDM: [server] effsize: 3, size: 2, rv: 0
line: SDM: [0] nodeID: 0, hostname: gcn64, port: 59269
SDM: [0] nodeID: 0, hostname: gcn64, port: 59269
SDM: [master] effsize: 3, size: 2, rv: 0
SDM: [master] Found routing file, size=2
SDM: [2] size 3
SDM: [2] route for 0 is {} 
SDM: [2] route for 1 is {} 
SDM: [2] sdm_route_get_route dest {0-2}, parent 2
SDM: [master] effsize: 3, size: 2, rv: 0
SDM: [2] nodeID: 0, hostname: gcn64, port: 59269
SDM: [2] nodeID: 1, hostname: gcn65, port: 55920
line: SDM: [0] Initialization successful
SDM: [0] Initialization successful
line: SDM: starting task 0
SDM: starting task 0
SDM: [2] nodeID: 1, hostname: gcn65, port: 55920
SDM: [2] Initialization successful
SDM: starting client
SDM: DbgMasterInit num_svrs=2
SDM: DbgMasterCreateSession host=localhost port=39704
line: SDM: [1] Initialization successful
SDM: [1] Initialization successful
line: SDM: starting task 1
SDM: starting task 1
SDM: DbgMasterStartSession(testmpitype,/nfs/home1/thomasge/source/testmpitype,)
SDM: [2] sdm_route_get_route dest {0-1}, parent 2
line: SDM: [1] sdm_route_get_route dest {0}, parent 2
SDM: [1] sdm_route_get_route dest {0}, parent 2
line: SDM: [0] sdm_route_get_route dest {1}, parent 2
SDM: [0] sdm_route_get_route dest {1}, parent 2
line: SDM: [1] sdm_route_get_route dest {2}, parent 2
SDM: [1] sdm_route_get_route dest {2}, parent 2
line: SDM: [0] sdm_route_get_route dest {2}, parent 2
SDM: [0] sdm_route_get_route dest {2}, parent 2
SDM: dbg_master_cmd_completed src="">
SDM: DbgMasterSetFuncBreakpoint(2:03,0,1,0,,main,,0,0)
SDM: [2] sdm_route_get_route dest {0-1}, parent 2
line: SDM: [1] sdm_route_get_route dest {0}, parent 2
SDM: [1] sdm_route_get_route dest {0}, parent 2
line: SDM: [0] sdm_route_get_route dest {1}, parent 2
SDM: [0] sdm_route_get_route dest {1}, parent 2
line: SDM: [1] sdm_route_get_route dest {2}, parent 2
SDM: [1] sdm_route_get_route dest {2}, parent 2
line: SDM: [0] sdm_route_get_route dest {2}, parent 2
SDM: [0] sdm_route_get_route dest {2}, parent 2
SDM: dbg_master_cmd_completed src="">
SDM: DbgMasterQuit()
SDM: [2] sdm_route_get_route dest {0-1}, parent 2
SDM: shutdown completed
line: SDM: [1] sdm_route_get_route dest {0}, parent 2
SDM: [1] sdm_route_get_route dest {0}, parent 2
line: SDM: [0] sdm_route_get_route dest {1}, parent 2
SDM: [0] sdm_route_get_route dest {1}, parent 2
SDM: DbgMasterFinish
SDM: all finished
waiting: 20129 20130
line: SDM: [0] sdm_route_get_route dest {2}, parent 2
SDM: [0] sdm_route_get_route dest {2}, parent 2
line: SDM: all finished
SDM: all finished
line: SDM: [1] sdm_route_get_route dest {2}, parent 2
SDM: [1] sdm_route_get_route dest {2}, parent 2
line: SDM: all finished
SDM: all finished
line: salloc: Relinquishing job allocation 1151231
salloc: Relinquishing job allocation 1151231
line: salloc: Job allocation 1151231 has been revoked.
salloc: Job allocation 1151231 has been revoked.
exit


thanks
Thomas

On Mon, Feb 2, 2015 at 9:23 AM, Thomas Geenen <geenen@xxxxxxxxx> wrote:
hi Greg,

My sort of working setup is indeed a merge of parts of the slurm batch and openmpi  interactive.
I use salloc to setup my resources and launch my job with mpirun using openmpi. In this way i can recycle the start_job.pl script from the OPENMPI directory
this will restult in a command like this

salloc --time=01:00:00 --partition=gpu_short --ntasks=2 --ntasks-per-node=1 mpirun -np 2 -mca orte_show_resolved_nodenames 1 -display-map /nfs/home1/thomasge/.eclipsesettings/sdm --port=34822 --host=localhost --debugger=gdb-mi --debug=13 --routing_file=/nfs/home1/thomasge/source/testmpitype/routing_file

This does allow me to debug a parallel run using multiple nodes but there is an issues.
When my allocation ends the termination of the job is not "seen" by eclipse and it remains in an undefined state (that i cannot terminate)
I basically have to restart eclipse.

thanks
Thomas


On Fri, Jan 30, 2015 at 3:34 PM, Greg Watson <g.watson@xxxxxxxxxxxx> wrote:
The sdm supports slurm using the SLURM_PROCID environment variable. The tricky bit is getting a debug job launched. Schedulers like Torque, LSF, etc., provide an interactive mode that is used to launch the job using the appropriate mpirun commend for the MPI runtime (e.g. Open MPI or MPICH2), but it tends to be very system specific. For slurm, you would need to copy the slurm-generic.xml target system configuration and add a submit-interactive-debug command (or submit-batch-debug if no interactive is possible). Take a look at the edu.sdsc.trestles.torque.interactive.openmpi.xml  configuration for an example.

Greg


On Jan 30, 2015, at 8:50 AM, Thomas Geenen <geenen@xxxxxxxxx> wrote:

hi Beth,

could you give me an update on the slurm support for debugging?
i have hacked something together for my local setup that sort of works but i am not really happy with it.

best
Thomas

On Mon, Aug 11, 2014 at 8:49 PM, Beth Tibbitts <beth@xxxxxxxxxx> wrote:
Well you seem enthusiastic and assuming you can do some development with guidance, must have access to a SLURM system, and test on SLURM... thats a great start.
We will discuss with other PTP developers at this week's PTP hackathon.

>how to go about testing/building a developer version. 
May seem intimidating I know, but we have instructions for that and we can help.
https://wiki.eclipse.org/PTP  under Developer Resources > Environment Setup.

I would recommend a different Eclipse installation and workspace for PTP plugin work from
your normal PTP user install+workspace.


...Beth

Beth Tibbitts


On Mon, Aug 11, 2014 at 2:43 PM, Biddiscombe, John A. <biddisco@xxxxxxx> wrote:
Beth

If there was any way I can help, then I happily volunteer. Unfortunately, I have never found the time to look into the eclipse/ptp/plugin source and so have absolutely no idea what/how to go about testing/building a developer version. Many times I vowed to myself that I’d try to get it working, but never did.

If you want a volunteer, then I’m in. But unless you can make use of a clueless loser and tell me what to do, then I probably won’t be any help.

JB

From: Beth Tibbitts <beth@xxxxxxxxxx<mailto:beth@xxxxxxxxxx>>
Reply-To: "ptp-user@xxxxxxxxxxx<mailto:ptp-user@xxxxxxxxxxx>" <ptp-user@xxxxxxxxxxx<mailto:ptp-user@xxxxxxxxxxx>>
Date: Monday 11 August 2014 17:44
To: "ptp-user@xxxxxxxxxxx<mailto:ptp-user@xxxxxxxxxxx>" <ptp-user@xxxxxxxxxxx<mailto:ptp-user@xxxxxxxxxxx>>
Subject: [ptp-user] PTP debugging with SLURM (was mpich support in luna?)

John,
we lost our committer who did the support and testing for SLURM.
If you are willing to work on it and test it, I'm sure other committers can give you some direction.
We are having a PTP hackathon this week in Baton Rouge, would be a good topic for discussion.


...Beth

Beth Tibbitts
beth@xxxxxxxxxx<mailto:beth@xxxxxxxxxx>


On Mon, Aug 11, 2014 at 9:53 AM, Biddiscombe, John A. <biddisco@xxxxxxx<mailto:biddisco@xxxxxxx>> wrote:
Dear list

I had a look at http://wiki.eclipse.org/images/8/80/PTP-user-debug-20140129.pdf which gives info on mpich debugging in ptp luna.

Question : Would this work with slurm? I’ve not had success with debugging using slurm previously, so I wonder if it is supported now.

Thanks

JB

_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx<mailto:ptp-user@xxxxxxxxxxx>
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user

_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user


_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user

_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user


_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/ptp-user



Back to the top