Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Orte jobs do not stop

I've got a little more information on this issue now.  It occurs when
OpenMPI has been configured with:

--enable-mpi-threads
--with-devel-headers
--enable-orterun-prefix-by-default
 --with-tm=/opt/torque

Apparently for any version of OpenMPI (tested with 1.2.3 and 1.2.6).
It does not occur when --enable-mpi-threads and
--enable-orterun-prefix-by-default are omitted.

I guess the focus now is on getting the new launch manager together,
but if any of this suggests a work-around for the current release,
please let me know.

-Wyatt

On Thu, Aug 7, 2008 at 5:13 PM, wspear <wspear@xxxxxxxxxxxxxx> wrote:
> Greetings Greg,
>
> Have you had a chance to look at this yet?
>
> Thanks,
> Wyatt
>
> On Thu, Jul 10, 2008 at 2:32 AM, Greg Watson <g.watson@xxxxxxxxxxxx> wrote:
>> Wyatt,
>>
>> I haven't tried PTP 2.0 with Open MPI 1.2.6 (only 1.2.5) so it's possible
>> that something has broken. I'll install it on my Linux VM and let you know
>> how it goes.
>>
>> Greg
>>
>> On Jul 9, 2008, at 6:08 PM, wspear wrote:
>>
>>> This is openmpi 1.2.6 built with gnu 4.1.2.  It's running on x86_64
>>> Linux.  I have been using the PTP 2.0 available from the update site
>>> (2.0.0.200806061515).  The behavior is the same in both Europa and
>>> Ganymede.
>>>
>>> -Wyatt
>>>
>>> On Wed, Jul 9, 2008 at 5:35 AM, Greg Watson <g.watson@xxxxxxxxxxxx> wrote:
>>>>
>>>> Wyatt,
>>>>
>>>> What version of Open MPI are you using? What type of system is it? Is
>>>> this
>>>> PTP 2.0 or from CVS?
>>>>
>>>> PTP 2.0 has not been tested with Ganymede, but it sounds like this is a
>>>> problem with Open MPI. Can you try with Europa to see if you have the
>>>> same
>>>> problem?
>>>>
>>>> Thanks,
>>>>
>>>> Greg
>>>>
>>>> On Jul 8, 2008, at 11:36 PM, wspear wrote:
>>>>
>>>>> Greetings,
>>>>>
>>>>> When I try to execute an mpi application with ptp via the orte it
>>>>> seems to run successfully, but after what should be the final output
>>>>> is printed the ptp continues to list the job status as running, and
>>>>> the orte process's processor usage shoots up to 100% in top.  If I try
>>>>> to stop the job or shut down the orte resource manager manually
>>>>> eclipse freezes solid and I need to kill the orte process from the
>>>>> command line.
>>>>>
>>>>> Three possibly relevant factors are that I'm using a version of
>>>>> openmpi configured for use with pbs (though I'm just running on the
>>>>> head node at the moment), I'm running these tests in the Ganymede
>>>>> Eclipse release, and I get a warning about oversubscribed nodes (which
>>>>> is also normal for running with mpirun on the headnode in this case).
>>>>>
>>>>> I don't know if any of those could explain why the application would
>>>>> run successfully while the orte fails to stop, though.
>>>>>
>>>>> When I run it on a back-end node, where interactive jobs are allowed,
>>>>> the execution completes without the warning, but the output only shows
>>>>> up on the command line where Eclipse was launched, and there is no
>>>>> sign that the start of the process or individual jobs were detected or
>>>>> handled by the PTP.  The orte process still freezes as described
>>>>> above.
>>>>>
>>>>> Any ideas how I might fix this?  Has anyone has been working on a pbs
>>>>> resource manager for ptp?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Wyatt
>>>>> _______________________________________________
>>>>> ptp-dev mailing list
>>>>> ptp-dev@xxxxxxxxxxx
>>>>> https://dev.eclipse.org/mailman/listinfo/ptp-dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> ptp-dev mailing list
>>>> ptp-dev@xxxxxxxxxxx
>>>> https://dev.eclipse.org/mailman/listinfo/ptp-dev
>>>>
>>>>
>>> _______________________________________________
>>> ptp-dev mailing list
>>> ptp-dev@xxxxxxxxxxx
>>> https://dev.eclipse.org/mailman/listinfo/ptp-dev
>>>
>>
>> _______________________________________________
>> ptp-dev mailing list
>> ptp-dev@xxxxxxxxxxx
>> https://dev.eclipse.org/mailman/listinfo/ptp-dev
>>
>>
>


Back to the top