Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[ptp-dev] Possible issues while running the "dsh" use case in SCI source distribution

Hello,

This is Rui ("Ray") Liu.  I work with Jay Alameda and Rick Kufrin at NCSA.  Recently I started working on trying SCI and attended October 4's conference call.  I observed some possible issues while running the "dsh" use case that came with the SCI source code, and am now reporting them to you.

On my desktop ("pureland", an i386 Ubuntu 8.04 Linux machine), I started the SCI daemon, and tried the distributed shell ("dsh") use case in the SCI source code that I got from CVS.  I did not see documentation on how to set up a "host.list" file, so with trial and error, I came up with one that seemed working: I put 2 lines of "127.0.0.1" in the "host.list" file for 2 BEs on the local host.  However, I don't know whether this is valid or not.

I observed that, most of the time, the use case worked as expected.  After starting, I could issue a command at the FE, such as "echo hello" or "ls", and the command would be executed at the BEs and the standard output brought back to the FE to display.  After being issued the "quit" command, the FE and the BEs all exited cleanly.

However, sometimes (about 1/3 of the time), there were some issues:

1) Possible deadlock:

Sometimes some threads in a BE would hang forever, without exiting.  I've attached a sample session that occurred last Friday, which showed one BE having 2 threads hanging.  I used "gdb" to find where the threads were, and found that the main thread was waiting in the pthread_join() in SCI_Terminate(), and the writer processor thread got stuck in MessageQueue::sem_wait_i().  It might be that another thread was supposed to post the semaphore to allow the writer processor to proceed to exit, but that thread exited without doing so.

2) Occasional hanging at the FE and crash of scidv1.

In a test today, after running "./use_ext_launcher" and giving "ls" as the command, the result from client ID 0 came back, but then the FE hung without printing client ID 1's output or the ">>>" prompt.  After I manually stopped it with "CTRL-C", found that the scidv1 daemon crashed as well, since the port 6688 was no longer open, and all the existing connections at port 6688 were in "TIME_WAIT" status.  I had to restart the daemon for later tests.  In a repeated test, the same sequence went fine without any issue.

I'm unsure whether you have seen these before, or it's due to errors on my part in using SCI, since I'm still new to SCI. :-)  If it's a use error on my part, please accept my apology.

Thanks,
Rui


Back to the top