Once again, with correct subject line.
On Wednesday, November 07, 2012 12:00:09
Carsten Karbach <c.karbach@xxxxxxxxxxxxx> wrote:
> Dear Christoph,
>
> this sounds like the server part of PTP's monitoring system is unable to
> map the running jobs to the compute nodes. By clicking on a job all
> nodes are grayed out, which do not belong to this job. This might look
> like all nodes are highlighted, since there are no compute nodes mapped
> to any job.
>
> To get more information about reasons for this behavior you can try the
> following:
>
> 1. On the remote machine, go to the ".eclipsesettings" directory,
> located in your home directory
> 2. Create a file called ".LML_da_options" containing a single line
> "keeptmp=1" (no quotes).
> 3. Restart the monitor.
> 4. You should now find a directory called "tmp_<hostname>_<pid>" in the
> ".eclipsesettings" directory. It should contain an error log file, plus
> a bunch of other files. Check these files to see if you can see the
> cause of the error.
> 5. Remember to remove the ".LML_da_options" file once you have finished.
>
> Best regards,
>
> Carsten
Dear Carsten,
it looks as if your guess is correct. LML_da.errlog shows many error messages
like this.
insert_job_into_nodedisplay: Error: could not map node p076-c32
insert_job_into_nodedisplay: Error: could not map node p118-c00
In file jobs_LML.xml, I can find jobs in status running with a list of nodes
specified, for example the following XML stanza.
<info oid="j000076" type="short">
<data key="queue" value="cluster"/>
<data key="dispatchdate" value="Sat Nov 10 05:29:37 CET 2012"/>
<data key="favored" value="No"/>
<data key="name" value="o3so2000"/>
<data key="step" value="pio01.dkrz.de.2809055.0"/>
<data key="group" value="mh0469"/>
<data key="owner" value="m214074"/>
<data key="queuedate" value="Sat Nov 10 05:29:12 CET 2012"/>
<data key="restart" value="yes"/>
<data key="state" value="Running"/>
<data key="nodelist" value="(p076,32)(p076,33)(p076,34)(p076,35)
(p076,36)(p076,37)(p076,38)(p076,39)(p076,40)(p076,41)(p076,42)(p076,43)
(p076,44)(p076,45)(p076,46)(p076,47)(p076,48)(p076,49)(p076,50)(p076,51)
(p076,52)(p076,53)(p076,54)(p076,55)(p076,56)(p076,57)(p076,58)(p076,59)
(p076,60)(p076,61)(p076,62)(p076,63)(p076,64)(p076,65)(p076,66)(p076,67)
(p076,68)(p076,69)(p076,70)(p076,71)(p076,72)(p076,73)(p076,74)(p076,75)
(p076,76)(p076,77)(p076,78)(p076,79)(p076,80)(p076,81)(p076,82)(p076,83)
(p076,84)(p076,85)(p076,86)(p076,87)(p076,88)(p076,89)(p076,90)(p076,91)
(p076,92)(p076,93)(p076,94)(p076,95)"/>
<data key="wall" value="28800"/>
<data key="wallsoft" value="28800"/>
<data key="classprio" value="50"/>
<data key="groupprio" value="50"/>
<data key="status" value="RUNNING"/>
<data key="totalcores" value="64"/>
<data key="totaltasks" value="64"/>
</info>
In file nodes_LML.xml, the nodes are known by their long name, like this
(taking p076, as it is mentioned in the previous XML stanza).
<info oid="nd000076" type="short">
<data key="ncores" value="64"/>
<data key="availmem" value="38295 mb"/>
<data key="physmem" value="124160 mb"/>
<data key="state" value="Busy"/>
<data key="id" value="p076.dkrz.de"/>
</info>
Could it be that this is a p076 vs. p076.dkrz.de issue ?
At any rate, I wrapped the whole directory
/pf/k/k205001/.eclipsesettings/tmp_blizzard2_29360720 into a tar.bz2 file and
placed it on juqueen.fz-juelich.de:/homec/ibm/pospiech
pospiech@juqueen2:~ $ pwd
/homec/ibm/pospiech
pospiech@juqueen2:~ $ ls -l tmp_blizzard2_29360720.tar.bz2
-rw-r--r-- 1 pospiech apache 124368 Nov 10 13:55
tmp_blizzard2_29360720.tar.bz2
Can you please have a look ? Thanks !
--
Mit freundlichen Grüßen / Kind regards
Dr. Christoph Pospiech
High Performance & Parallel Computing
Phone: +49-351 86269826
Mobile: +49-171-765 5871
E-Mail: christoph.pospiech@xxxxxxxxxx
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Reinhard Reschke, Dieter Scholz, Gregor Pillen, Joachim Heel, Christian Noll
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940
|