Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[ptp-dev] Possible issues while running the "dsh" use case in SCI source distribution

Sorry I forgot the attachment file "deadlock.txt"...

-------- Original Message --------
Subject: Possible issues while running the "dsh" use case in SCI source distribution
Date: Mon, 01 Nov 2010 16:35:01 -0500
From: Rui Liu <ruiliu@xxxxxxxxxxxxxxxxx>
Organization: NCSA
To: ptp-dev@xxxxxxxxxxx
CC: Jay Alameda <jalameda@xxxxxxxxxxxxxxxxx>,  Rick Kufrin <rkufrin@xxxxxxxxxxxx>

Hello,

This is Rui ("Ray") Liu.  I work with Jay Alameda and Rick Kufrin at NCSA.  Recently I started working on trying SCI and attended October 4's conference call.  I observed some possible issues while running the "dsh" use case that came with the SCI source code, and am now reporting them to you.

On my desktop ("pureland", an i386 Ubuntu 8.04 Linux machine), I started the SCI daemon, and tried the distributed shell ("dsh") use case in the SCI source code that I got from CVS.  I did not see documentation on how to set up a "host.list" file, so with trial and error, I came up with one that seemed working: I put 2 lines of "127.0.0.1" in the "host.list" file for 2 BEs on the local host.  However, I don't know whether this is valid or not.

I observed that, most of the time, the use case worked as expected.  After starting, I could issue a command at the FE, such as "echo hello" or "ls", and the command would be executed at the BEs and the standard output brought back to the FE to display.  After being issued the "quit" command, the FE and the BEs all exited cleanly.

However, sometimes (about 1/3 of the time), there were some issues:

1) Possible deadlock:

Sometimes some threads in a BE would hang forever, without exiting.  I've attached a sample session that occurred last Friday, which showed one BE having 2 threads hanging.  I used "gdb" to find where the threads were, and found that the main thread was waiting in the pthread_join() in SCI_Terminate(), and the writer processor thread got stuck in MessageQueue::sem_wait_i().  It might be that another thread was supposed to post the semaphore to allow the writer processor to proceed to exit, but that thread exited without doing so.

2) Occasional hanging at the FE and crash of scidv1.

In a test today, after running "./use_ext_launcher" and giving "ls" as the command, the result from client ID 0 came back, but then the FE hung without printing client ID 1's output or the ">>>" prompt.  After I manually stopped it with "CTRL-C", found that the scidv1 daemon crashed as well, since the port 6688 was no longer open, and all the existing connections at port 6688 were in "TIME_WAIT" status.  I had to restart the daemon for later tests.  In a repeated test, the same sequence went fine without any issue.

I'm unsure whether you have seen these before, or it's due to errors on my part in using SCI, since I'm still new to SCI. :-)  If it's a use error on my part, please accept my apology.

Thanks,
Rui

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % netstat -an | grep 6688

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % sudo /opt/sci/sbin/scidv1 
[sudo] password for ruiliu: 

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % netstat -an | grep 6688
tcp6       0      0 :::6688                 :::*                    LISTEN     

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % more host.list
127.0.0.1
127.0.0.1

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % more setup-env
setenv LD_LIBRARY_PATH /opt/sci/lib
setenv SCI_BACKEND_NUM 2
setenv PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home
/ruiliu/bin:/home/ruiliu/bin:/opt/sci/bin
setenv SCI_LOG_DIRECTORY ./log
setenv SCI_LOG_LEVEL 6

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % source setup-env

pureland 10/29 4:47pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % ./use_ext_launcher
Start back ends ...
export SCI_JOB_KEY=492407142; export SCI_USE_EXTLAUNCHER=yes; export SCI_CLIENT_ID=0; ./dsh_be
export SCI_JOB_KEY=492407142; export SCI_USE_EXTLAUNCHER=yes; export SCI_CLIENT_ID=1; ./dsh_be
Wait 5 seconds ...
Start front end ...
export SCI_JOB_KEY=492407142; export SCI_USE_EXTLAUNCHER=yes; ./dsh_fe
>>> echo $PWD
1: /home/ruiliu/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh
0: /home/ruiliu/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh
>>> echo $SCI_CLIENT_ID
1: 1
0: 0
>>> quit

pureland 10/29 4:48pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % ps ux | grep dsh
ruiliu    6483  0.0  0.1  28428  1784 pts/0    Sl   16:48   0:00 ./dsh_be
ruiliu    6513  0.0  0.0   3004   752 pts/0    S+   16:48   0:00 grep dsh

pureland 10/29 4:48pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % ps -eLf | head -1
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD

pureland 10/29 4:48pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % ps -eLf | grep dsh
ruiliu    6483     1  6483  0    2 16:48 pts/0    00:00:00 ./dsh_be
ruiliu    6483     1  6500  0    2 16:48 pts/0    00:00:00 ./dsh_be
ruiliu    6515  5758  6515  0    1 16:48 pts/0    00:00:00 grep dsh

pureland 10/29 4:48pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % gdb -p 6483
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...
This GDB was configured as "i486-linux-gnu".
Attaching to process 6483
Reading symbols from /home/ruiliu/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh/dsh_be...done.
Reading symbols from /lib/tls/i686/cmov/libdl.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libdl.so.2
...
[New Thread 0xb7bee6c0 (LWP 6483)]
[New Thread 0xb7b9bb90 (LWP 6500)]
Loaded symbols for /lib/tls/i686/cmov/libpthread.so.0
Reading symbols from /opt/sci/lib/libsci.so.0...done.
Loaded symbols for /opt/sci/lib/libsci.so.0
...
Reading symbols from /lib/tls/i686/cmov/libresolv.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libresolv.so.2
0xb7f16410 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7f16410 in __kernel_vsyscall ()
#1  0xb7eee775 in pthread_join () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7ecfa4a in Thread::join (this=0x8051120) at thread.cpp:79
#3  0xb7ea83cb in CtrlBlock::term (this=0x804b010) at ctrlblock.cpp:225
#4  0xb7ea275f in SCI_Terminate () at api.cpp:118
#5  0x0804891f in main (argc=Cannot access memory at address 0x0
) at dsh_be.c:82
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /home/ruiliu/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh/dsh_be, process 6483

pureland 10/29 4:49pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % gdb -p 6500
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...
Attaching to process 6500

warning: process 6500 is a cloned process
Reading symbols from /home/ruiliu/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh/dsh_be...done.
Reading symbols from /lib/tls/i686/cmov/libdl.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libdl.so.2
Reading symbols from /lib/tls/i686/cmov/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0xb7bee6c0 (LWP 6483)]
[New Thread 0xb7b9bb90 (LWP 6500)]
Loaded symbols for /lib/tls/i686/cmov/libpthread.so.0
Reading symbols from /opt/sci/lib/libsci.so.0...done.
Loaded symbols for /opt/sci/lib/libsci.so.0
...
Reading symbols from /lib/tls/i686/cmov/libresolv.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libresolv.so.2
0xb7f16410 in __kernel_vsyscall ()
(gdb) bt
#0  0xb7f16410 in __kernel_vsyscall ()
#1  0xb7ef3da0 in sem_wait@GLIBC_2.0 () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7ea3d3c in MessageQueue::sem_wait_i (this=0x8051290, psem=0x80512d0, usecs=-1000)
    at queue.cpp:229
#3  0xb7ea3ebf in MessageQueue::consume (this=0x8051290, millisecs=-1) at queue.cpp:160
#4  0xb7ecb37b in WriterProcessor::read (this=0x8051120) at writerproc.cpp:56
#5  0xb7ec9ecd in Processor::run (this=0x8051120) at processor.cpp:58
#6  0xb7ecfc78 in init (pthis=0x8051120) at thread.cpp:45
#7  0xb7eed4fb in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#8  0xb7ccfe5e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /home/ruiliu/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh/dsh_be, process 6500

pureland 10/29 4:49pm ~/software/org.eclipse.ptp/tools/sci/org.eclipse.ptp.sci/usecase/dsh % 

Back to the top