Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [ptp-dev] Possible issues while running the "dsh" use case in SCI source distribution

Hi Rong,

Thanks a lot for looking into it, and thanks for the welcome message! :-)

I'm glad to report that following your suggestion, I checked out the latest CVS source code, and in testing, I couldn't reproduce the issues either. :-)

Sorry I didn't make a distinction in my previous message between "i386" and "i686".  The machine I used was an "i686", not "i386".  "uname -a" on this machine returned:
   pureland 11/03 10:35am ~ % uname -a
   Linux pureland 2.6.23-perfmon2 #1 SMP Mon Nov 3 11:30:38 CST 2008 i686 GNU/Linux
And as I mentioned, it runs Ubuntu 8.04 Linux.

Thanks for explaining the syntax of the "host.list" file!  So putting 2 lines of "127.0.0.1"" is OK, right?

I found that I checked out the CVS source code on 2010-10-05, and the last timestamp of the source code files was 2010-08-04.  Following your suggestion, I made a clean CVS checkout again today and tested again.  To my pleasant surprise, I could not reproduce the issues in more than 10 attempts.  So, it seems that the issues were gone.

A closer look revealed that there were substantial changes in the source code: more than 55 ".hpp" or ".cpp" files were changed or checked in in the mean time, one batch having a timestamp of 2010-10-08, the other batch having a timestamp of 2010-10-26.  Likely some of these changes resolved the issues.  Anyway, I'm glad that I don't see the issues any more.  SCI has the potential of being used by other software, so it would be good for them to have a solid base. :-)

Thanks again for looking into it and for your prompt response!

Thanks,
Rui


On 11/03/2010 05:10 AM, Rong lI Li wrote:

Hi, Rui Liu,

Thanks for reporting issues to us. Welcome for using SCI!

Firstly, for the "host.list" file, you can put the hostname(such as: "hostname1") or the ip addr(such as: "9.123.100.40") into the file. Both the hostname and ip addr are supported.

Then I tried to recreate the issues on both our amd_64 nodes and my vmware(suse linux OS image), as we do not have i386 nodes in local. But I failed to recreate the issue. The "dsh" and "./use_ext_launcher" ran successfully.

[ronglli@pexserv03:~/SCI/SCI_CVS/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh]$ ./dsh_fe
 >>> ls
0: CVS
0: Makefile
0: Makefile.aix
.........
0: use_ext_launcher
0: use_ext_launcher2
0: use_ext_launcher_hl
1: CVS
1: Makefile
1: Makefile.aix
............
1: use_ext_launcher
1: use_ext_launcher2
1: use_ext_launcher_hl
 >>> quit

[ronglli@pexserv03:~/SCI/SCI_CVS/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh]$ ./use_ext_launcher
Start back ends ...
export SCI_JOB_KEY=17400; export SCI_USE_EXTLAUNCHER=yes; export SCI_CLIENT_ID=0; ./dsh_be export SCI_JOB_KEY=17400; export SCI_USE_EXTLAUNCHER=yes; export SCI_CLIENT_ID=1; ./dsh_be
Wait 5 seconds ...
Start front end ...
export SCI_JOB_KEY=17400; export SCI_USE_EXTLAUNCHER=yes; ./dsh_fe
 >>> ls
0: CVS
0: Makefile
0: Makefile.aix
.......
0: use_ext_launcher
0: use_ext_launcher2
0: use_ext_launcher_hl
1: CVS
1: Makefile
1: Makefile.aix
..............
1: use_ext_launcher
1: use_ext_launcher2
1: use_ext_launcher_hl
 >>> quit

So would you pls help check the version you are used(when did you get the copy of the CVS SCI files) and provide more infos? Thanks a lot! If you are not using the up-to-date one, you can use "export CVSROOT=:pserver:anonymous@xxxxxxxxxxxxxxx/cvsroot/tools; cvs checkout org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci" to download the newest copy and have a try again. Also you can try to run it on other platforms, such as "i686" nodes, to see whether you will always meet this issue.
I will continue to find i386 node to have a try.

Thanks again, and pls feel free to let us if you have any other issues.





 > From: Rui Liu <ruiliu@xxxxxxxxxxxxxxxxx>
 > Date: November 1, 2010 5:35:01 PM EDT
 > To: ptp-dev@xxxxxxxxxxx
> Cc: Rick Kufrin <rkufrin@xxxxxxxxxxxx>, Jay Alameda <jalameda@xxxxxxxxxxxxxxxxx> > Subject: [ptp-dev] Possible issues while running the "dsh" use case in SCI source distribution > Reply-To: Parallel Tools Platform general developers <ptp-dev@xxxxxxxxxxx>
 >
 > Hello,
 >
> This is Rui ("Ray") Liu. I work with Jay Alameda and Rick Kufrin at NCSA. Recently I started working on trying SCI and attended October 4's conference call. I observed some possible issues while running the "dsh" use case that came with the SCI source code, and am now reporting them to you.
 >
> On my desktop ("pureland", an i386 Ubuntu 8.04 Linux machine), I started the SCI daemon, and tried the distributed shell ("dsh") use case in the SCI source code that I got from CVS. I did not see documentation on how to set up a "host.list" file, so with trial and error, I came up with one that seemed working: I put 2 lines of "127.0.0.1" in the "host.list" file for 2 BEs on the local host. However, I don't know whether this is valid or not.
 >
> I observed that, most of the time, the use case worked as expected. After starting, I could issue a command at the FE, such as "echo hello" or "ls", and the command would be executed at the BEs and the standard output brought back to the FE to display. After being issued the "quit" command, the FE and the BEs all exited cleanly.
 >
 > However, sometimes (about 1/3 of the time), there were some issues:
 >
 > 1) Possible deadlock:
 >
> Sometimes some threads in a BE would hang forever, without exiting. I've attached a sample session that occurred last Friday, which showed one BE having 2 threads hanging. I used "gdb" to find where the threads were, and found that the main thread was waiting in the pthread_join() in SCI_Terminate(), and the writer processor thread got stuck in MessageQueue::sem_wait_i(). It might be that another thread was supposed to post the semaphore to allow the writer processor to proceed to exit, but that thread exited without doing so.
 >
 > 2) Occasional hanging at the FE and crash of scidv1.
 >
> In a test today, after running "./use_ext_launcher" and giving "ls" as the command, the result from client ID 0 came back, but then the FE hung without printing client ID 1's output or the ">>>" prompt. After I manually stopped it with "CTRL-C", found that the scidv1 daemon crashed as well, since the port 6688 was no longer open, and all the existing connections at port 6688 were in "TIME_WAIT" status. I had to restart the daemon for later tests. In a repeated test, the same sequence went fine without any issue.
 >
> I'm unsure whether you have seen these before, or it's due to errors on my part in using SCI, since I'm still new to SCI. :-) If it's a use error on my part, please accept my apology.
 >
 > Thanks,
 > Rui
 > _______________________________________________
 > ptp-dev mailing list
 > ptp-dev@xxxxxxxxxxx
 > https://dev.eclipse.org/mailman/listinfo/ptp-dev

=====================

Rong "Jessica", Li (李荣)
IBM Systems &Technology Group, Development
SOFTWARE ENGINEER
Tel:86-10-82451010  Email:ronglli@xxxxxxxxxx
Address: 1F, Building 28, Zhong Guan Cun Software Park, No.8, Dong Bei Wang West Road. Hai Dian District, Beijing 100193, PRC





Back to the top