Brent N. Chun ~ California Institute of Technology ~ CACR

Overview

GEXEC is a scalable cluster remote execution system which provides fast, RSA authenticated remote execution of parallel and distributed jobs. It provides transparent forwarding of stdin, stdout, stderr, and signals to and from remote processes, provides local environment propagation, and is designed to be robust and to scale to systems over 1000 nodes. Internally, GEXEC operates by building an n-ary tree of TCP sockets and threads between gexec daemons and propagating control information up and down the tree. By using hierarchical control, GEXEC distributes both the work and resource usage associated with massive amounts of parallelism across multiple nodes, thereby eliminating problems associated with single node resource limits (e.g., limits on the number of file descriptors on front-end nodes). An initial release of the software (below) consists of a daemon, a client program, and a library which provides programmatic interface to the GEXEC system.

Software

Update: GEXEC source code and releases are now maintained as part of the Ganglia project. The GEXEC source can be checked out via svn at: http://sourceforge.net/svn/?group_id=43021. The source code can be browsed directly at: http://ganglia.svn.sf.net/viewvc/ganglia/trunk/gexec/gexec. See the Ganglia SourceForge page for more.

Version Release Date Source RPM(s) SRPM
v0.3.7 06.04.2008 gexec-0.3.7.tar.gz gexec-0.3.7-1.i386.rpm gexec-0.3.7-1.src.rpm
v0.3.6 09.27.2004 gexec-0.3.6.tar.gz gexec-0.3.6-1.i386.rh9.rpm gexec-0.3.6-1.i386.rpm gexec-0.3.6-1.src.rpm
v0.3.5 08.07.2002 gexec-0.3.5.tar.gz gexec-0.3.5-1.i386.rpm gexec-0.3.5-1.src.rpm
v0.3.4 04.29.2002 gexec-0.3.4.tar.gz gexec-0.3.4-1.i386.rpm gexec-0.3.4-1.ia64.rpm gexec-0.3.4-1.src.rpm
v0.3.3 04.23.2002 gexec-0.3.3.tar.gz gexec-0.3.3-1.i386.rpm gexec-0.3.3-1.src.rpm
v0.3.2 04.20.2002 gexec-0.3.2.tar.gz gexec-0.3.2-1.i386.rpm
v0.3.1 04.18.2002 gexec-0.3.1.tar.gz gexec-0.3.1-1.i386.rpm
v0.3.0 03.22.2002 gexec-0.3.0.tar.gz gexec-0.3.0-1.i386.rpm
v0.2.1 03.22.2002 gexec-0.2.1.tar.gz gexec-0.2.1-1.i386.rpm
v0.2.0 03.19.2002 gexec-0.2.0.tar.gz gexec-0.2.0-1.i386.rpm
v0.1 03.11.2002 gexec-0.1.tar.gz gexec-0.1-1.i386.rpm
ChangeLog

Documentation

  1. Install authd on all nodes in the cluster. See the authd web page for installation instructions.

  2. Add the following line to /etc/services on all nodes in the cluster (not necessary on RedHat 7.3, but needed on certain distributions such as RedHat 9.0):
        gexec   2875/tcp    # Caltech GEXEC
    

  3. Install GEXEC on all nodes in the cluster (e.g., cluster nodes tgl0, tgl1, ..)

    tgl0# rpm -ivh gexec-0.3.7-1.i386.rh9.rpm
    tgl1# rpm -ivh gexec-0.3.7-1.i386.rh9.rpm
    tgl2# ...

  4. Run the client program gexec. Note that on newer Linux kernels (e.g., the 2.4.x RedHat 9 kernel), you'll need to set the LD_ASSUME_KERNEL environment variable to "2.2.5" to avoid LinuxThreads bugs (e.g., incomplete implementation of POSIX cancellation points).

    # export LD_ASSUME_KERNEL="2.2.5"
    # export GEXEC_SVRS="tgl0 tgl1 tgl2 tgl3"
    # gexec -n 4 hostname
    1 tgl1
    3 tgl3
    0 tgl0
    2 tgl2
    
The RPM install/uninstall procedures deal with installing/uninstalling of the software and the starting/stopping of the daemons. Since GEXEC runs via xinetd, starting/stopping of the daemons simply involves the addition/removal of a file in /etc/xinetd.d and sending SIGUSR2 to xinetd to cause xinetd to reread its configuration files.

GEXEC can be used interactively using the gexec client or programmatically using the GEXEC library, libgexec.a. With the client, node selection can be done in one of two ways. It can done by explictly naming a set of nodes using the GEXEC_SVRS environment variable:

# export LD_ASSUME_KERNEL="2.2.5"
# export GEXEC_SVRS="tgl0 tgl1 tgl2 tgl3"
# gexec -n 4 hostname
1 tgl1
3 tgl3
0 tgl0
2 tgl2

Alternatively, node selection can also be done using Ganglia by specifying one or more potential gmond servers to query. The first gmond server that is both up and returns a non-empty set of nodes will be used to provide the list of nodes ('-n 0' means all the nodes, a five-node cluster in the example below):

# export LD_ASSUME_KERNEL="2.2.5"
# export GEXEC_GMOND_SVRS="tgl1 tgl3"
# gexec -n 0 hostname
1 tgl1
4 tgl4
3 tgl3
0 tgl0
2 tgl2

License

BSD license.

Feedback

Send me email if you're having problems, find bugs, or have any random comments: Brent Chun. GEXEC is known to work unmodified on RedHat 7.2 and 7.3. With the latest 0.3.6 release, it is also known to work unmodified on RedHat 9.0. With some work, it is also known to work on Mandrake, RedHat Enterprise Linux 3.1, and under FreeBSD.

You might also be interested in GEXEC's web page on freshmeat.


bnc, PGP Public Key