[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [ptp-user] slurm tasks not honoured?
|
I think a more general question would be
With a slurm resource manager running - which seems to be fine - How do I enter specific mvapich2 settings to ensure that when the job is submitted to slurm, the correct launch procedure is followed?
the PBS resource manager seems to have all the options for changing the MPI command etc, but I can't find the equivalent using slurm.
(our system changed from PBS to slurm a few months ago and this is my first attempt to setup things since then).
(we are using slurm-2.3.0-pre5 by the look of things)
thanks (hopefully)
JB
-----Original Message-----
From: ptp-user-bounces@xxxxxxxxxxx [mailto:ptp-user-bounces@xxxxxxxxxxx] On Behalf Of Biddiscombe, John A.
Sent: 20 February 2012 12:29
To: PTP User list
Subject: [ptp-user] slurm tasks not honoured?
Seeing the email about the release of ptp 5.0.5 I updated eclipse and downloaded the proxy zip file recompiled utils, proxy and sdm
all seems fine, but when I run a job, the num tasks is always 1 it seems.
Launching with 16 tasks on one node, it outputs this (note the exception every time on job launch)
SLURM@Local: ptp_slurm_proxy: Job step aborted: Waiting up to 2 seconds for job step to finish.
SLURM@Local: Send Job/Process StateChange Event: state=32772
SLURM@Local: job[15974] iothread exit on EOF/ERROR of stdout fd
SLURM@Local: job[15974] iothread exit on Error/EOF of stderr fd.
SLURM@Local: Send Job/Process StateChange Event: state=4
SLURM@Local: Job[15974] no longer exist in SLURM. Romove it!
SLURM@Local: SLURM_SubmitJob (2):
SLURM@Local: job submit commands:
SLURM@Local: jobTimeLimit=55
SLURM@Local: launchedByPTP=true
SLURM@Local: jobNumProcs=16
SLURM@Local: execPath=/project/csvis/biddisco/eiger/build/pv-os/bin
SLURM@Local: progArgs=-rc
SLURM@Local: progArgs=-ch=148.187.14.220
SLURM@Local: progArgs=--use-offscreen-rendering
SLURM@Local: jobNumNodes=1
SLURM@Local: execName=pvserver
SLURM@Local: jobPartition=stdMem
SLURM@Local: jobSubId=JOB_13297370315374
SLURM@Local: Job[15975] io thread create done.
SLURM@Local: Send Job/Process StateChange Event: state=1
java.lang.NullPointerException
at org.eclipse.ptp.ui.views.MachinesNodesView$JobListener.handleEvent(MachinesNodesView.java:111)
at org.eclipse.ptp.rmsystem.AbstractResourceManagerMonitor.fireJobChanged(AbstractResourceManagerMonitor.java:241)
at org.eclipse.ptp.rmsystem.AbstractResourceManager.fireJobChanged(AbstractResourceManager.java:510)
at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManager.fireJobChanged(AbstractRuntimeResourceManager.java:145)
at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManagerMonitor.doUpdateJobs(AbstractRuntimeResourceManagerMonitor.java:988)
at org.eclipse.ptp.rtsystem.AbstractRuntimeResourceManagerMonitor.handleEvent(AbstractRuntimeResourceManagerMonitor.java:348)
at org.eclipse.ptp.rtsystem.AbstractRuntimeSystem.fireRuntimeJobChangeEvent(AbstractRuntimeSystem.java:90)
at org.eclipse.ptp.rtsystem.AbstractProxyRuntimeSystem.handleEvent(AbstractProxyRuntimeSystem.java:368)
at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.fireProxyRuntimeJobChangeEvent(AbstractProxyRuntimeClient.java:249)
at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.processRunningEvent(AbstractProxyRuntimeClient.java:677)
at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient.runStateMachine(AbstractProxyRuntimeClient.java:937)
at org.eclipse.ptp.proxy.runtime.client.AbstractProxyRuntimeClient$StateMachineThread.run(AbstractProxyRuntimeClient.java:94)
at java.lang.Thread.run(Thread.java:736)
and doing a scontrol show job ID --details gives this
JobId=15975 Name=pvserver
UserId=biddisco(20569) GroupId=csstaff(1000)
Priority=11025 Account=csstaff QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:01:08 TimeLimit=00:55:00 TimeMin=N/A
SubmitTime=12:23:51 EligibleTime=12:23:51
StartTime=12:23:51 EndTime=12:24:59
PreemptTime=NO_VAL SuspendTime=None SecsPreSuspend=0
Partition=stdMem AllocNode:Sid=eiger220:4509
ReqNodeList=(null) ExcNodeList=(null)
NodeList=eiger200
BatchHost=eiger200
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=eiger200 CPU_IDs=1 Mem=0
MinCPUsNode=1 MinMemoryCPU=12000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=(null)
I suspect the generation of the slurm params is fishy. Is it possible to edit them by hand? (I think there was a template somewhere, but I can't remember/find it).
It's quite possible I'm doing something wrong as I'm new to this.
Any advice welcome.
thanks
JB
_______________________________________________
ptp-user mailing list
ptp-user@xxxxxxxxxxx
https://dev.eclipse.org/mailman/listinfo/ptp-user