Return to LinuxDig.Com HowTo's

1. Introduction

An SSI cluster is a collection of computers that work together as if they are a single highly-available supercomputer. There are at least three reasons to create an SSI cluster of virtual UML machines.


1.1. Overview of SSI Clustering

The raison d'être of the SSI Clustering project is to provide a full, highly available SSI environment for Linux. Goals for this project include availability, scalability and manageability, using standard servers. Technology pieces include: membership, single root and single init, single process space and process migration, load leveling, single IPC, device and networking space, and single management space.

The SSI project was seeded with HP's NonStop Clusters for UnixWare (NSC) technology. It also leverages other open source technologies, such as Cluster Infrastructure (CI), Global File System (GFS), keepalive/spawndaemon, Linux Virtual Server (LVS), and the Mosix load-leveler, to create the best general-purpose clustering environment on Linux.


1.1.1. Cluster Infrastructure (CI)

The CI project is developing a common infrastructure for Linux clustering by extending the Cluster Membership Subsystem (CLMS) and Internode Communication Subsystem (ICS) from HP's NonStop Clusters for Unixware (NSC) code base.


1.1.3. Keepalive/Spawndaemon

keepalive is a process monitoring and restart daemon that was ported from HP's Non-Stop Clusters for UnixWare (NSC). It offers significantly more flexibility than the respawn feature of init.

spawndaemon provides a command-line interface for keepalive. It's used to control which processes keepalive monitors, along with various other parameters related to monitoring and restart.

Keepalive/spawndaemon is currently incompatible with the GFS shared root. keepalive makes use of shared writable memory mapped files, which OpenGFS does not yet support. It's only mentioned for the sake of completeness.


1.1.4. Linux Virtual Server (LVS)

LVS allows you to build highly scalable and highly available network services over a set of cluster nodes. LVS offers various ways to load-balance connections (e.g., round-robin, least connection, etc.) across the cluster. The whole cluster is known to the outside world by a single IP address.

The SSI project will become more tightly integrated with LVS in the future. An advantage will be greatly reduced administrative overhead, because SSI kernels have the information necessary to automate most LVS configuration. Another advantage will be that the SSI environment allows much tighter coordination among server nodes.

LVS support is turned off in the current binary release of SSI/UML. To experiment with it you must build your own kernel as described in Section 4.


1.1.5. Mosix Load-Leveler

The Mosix load-leveler provides automatic load-balancing within a cluster. Using the Mosix algorithms, the load of each node is calculated and compared to the loads of the other nodes in the cluster. If it's determined that a node is overloaded, the load-leveler chooses a process to migrate to the best underloaded node.

Only the load-leveling algorithms have been taken from Mosix. The SSI Clustering project is using its own process migration model, membership mechanism and information sharing scheme.

The Mosix load-leveler is turned off in the current binary release of SSI/UML. To experiment with it you must build your own kernel as described in Section 4.


1.2. Overview of UML

User-Mode Linux (UML) allows you to run one or more virtual Linux machines on a host Linux system. It includes virtual block, network, and serial devices to provide an environment that is almost as full-featured as a hardware-based machine.


1.3. Intended Audience

The following are various cluster types found in use today. If you use or intend to use one of these cluster types, you may want to consider SSI clustering as an alternative or addition.

  • High performance (HP) clusters, typified by Beowulf clusters, are constructed to run parallel programs (weather simulations, data mining, etc.).

  • Load-leveling clusters, typified by Mosix, are constructed to allow a user on one node to spread his workload transparently across all nodes in the cluster. This can be very useful for compute intensive, long running jobs that aren't massively parallel.

  • Web-service clusters, typified by the Linux Virtual Server (LVS) project and Piranha, do a different kind of load leveling. Incoming web service requests are load-leveled by a front end system across a set of standard servers.

  • Storage clusters, typified by Sistina's GFS and the OpenGFS project, consist of nodes which supply parallel, coherent, and highly available access to filesystem data.

  • Database clusters, typified by Oracle 9I RAC (formerly Oracle Parallel Server), consist of nodes which supply parallel, coherent, and HA access to a database.

  • High Availability clusters, typified by Lifekeeper, FailSafe and Heartbeat, are also often known as failover clusters. Resources, most importantly applications and nodes, are monitored. When a failure is detected, scripts are used to fail over IP addresses, disks, and filesystems, as well as restarting applications.

For more information about how SSI clustering compares to the cluster types above, read Bruce Walker's Introduction to Single System Image Clustering.


2. Getting Started

This section is a quick start guide for installing and running an SSI cluster of virtual UML machines. The most time-consuming part of this procedure is downloading the root image.


2.1. Root Image

First you need to download a SSI-ready root image. The compressed image weighs in at over 150MB, which will take more than six hours to download over a 56K modem, or about 45 minutes over a 500K broadband connection.

The image is based on Red Hat 7.2. This means the virtual SSI cluster will be running Red Hat, but it does not matter which distribution you run on the host system. A more advanced user can make a new root image based on another distribution. This is described in Section 5.

After downloading the root image, extract and install it.

host$ tar jxvf ~/ssiuml-root-rh72-0.6.5-1.tar.bz2
host$ su
host# cd ssiuml-root-rh72
host# make install
host# Ctrl-D
   

2.2. UML Utilities

Download the UML utilities. Extract, build, and install them.

host$ tar jxvf ~/uml_utilities_20020428.tar.bz2
host$ su
host# cd tools
host# make install
host# Ctrl-D
   

2.3. SSI/UML Utilities

Download the SSI/UML utilities. Extract, build, and install them.

host$ tar jxvf ~/ssiuml-utils-0.6.5-1.tar.bz2
host$ su
host# cd ssiuml-utils
host# make install
host# Ctrl-D
   

2.4. Booting the Cluster

Assuming X Windows is running or the DISPLAY variable is set to an available X server, start a two node cluster with

host$ ssi-start 2
   

This command boots nodes 1 and 2. It displays each console in a new xterm. The nodes run through their early kernel initialization, then seek each other out and form an SSI cluster before booting the rest of the way. If you're anxious to see what an SSI cluster can do, skip ahead to Section 3.

You'll probably notice that two other consoles are started. One is the lock server node, which is an artefact of how the GFS shared root is implemented at this time. The console is not a node in the cluster, and it won't give you a login prompt. For more information about the lock server, see Section 7.3. The other console is for the UML virtual networking switch daemon. It won't give you a prompt, either.

Note that only one SSI/UML cluster can be running at a time, although it can be run as a non-root user.

The argument to ssi-start is the number of nodes that should be in the cluster. It must be a number between 1 and 15. If this argument is omitted, it defaults to 3. The fifteen node limit is arbitrary, and can be easily increased in future releases.

To substitute your own SSI/UML files for the ones in /usr/local/lib and /usr/local/bin, provide your pathnames in ~/.ssiuml/ssiuml.conf. Values to override are KERNEL, ROOT, CIDEV, INITRD, and INITRD_MEMEXP. This feature is only needed by an advanced user.


3. Playing Around

Bring up a three node cluster with ssi-start. Log in to all three consoles as root. The initial password is root, but you'll be forced to change it the first time you log in.

The following demos should familiarize you with what an SSI cluster can do.


4. Building a Kernel and Ramdisk

Building your own kernel and ramdisk is necessary if you want to

Otherwise, feel free to skip this section.


4.1. Getting SSI Source

SSI source code is available as official release tarballs and through CVS. The CVS repository contains the latest, bleeding-edge code. It can be less stable than the official release, but it has features and bugfixes that the release does not have.


4.1.1. Official Release

The latest SSI release can be found at the top of this release list. At the time of this writing, the latest release is 0.6.5.

Download the latest release. Extract it.

host$ tar jxvf ~/ssi-linux-2.4.16-v0.6.5.tar.bz2
    

Determine the corresponding kernel version number from the release name. It appears before the SSI version number. For the 0.6.5 release, the corresponding kernel version is 2.4.16.


4.1.2. CVS Checkout

Follow these instructions to do a CVS checkout of the latest SSI code. The modulename is ssic-linux.

You also need to check out the latest CI code. Follow these instructions to do that. The modulename is ci-linux.

To do a developer checkout, you must be a CI or SSI developer. If you are interested in becoming a developer, read Section 8.3 and Section 8.4.

Determine the corresponding kernel version with

host$ head -4 ssic-linux/ssi-kernel/Makefile
VERSION = 2
PATCHLEVEL = 4
SUBLEVEL = 16
EXTRAVERSION =
    

In this case, the corresponding kernel version is 2.4.16. If you're paranoid, you might want to make sure the corresponding kernel version for CI is the same.

host$ head -4 ci-linux/ci-kernel/Makefile
VERSION = 2
PATCHLEVEL = 4
SUBLEVEL = 16
EXTRAVERSION =
    

They will only differ when I'm merging them up to a new kernel version. There is a window between checking in the new CI code and the new SSI code. I'll do my best to minimize that window. If you happen to see it, wait a few hours, then update your sandboxes.

host$ cd ssic-linux
host$ cvs up -d
host$ cd ../ci-linux
host$ cvs up -d
host$ cd ..
    

4.2. Getting the Base Kernel

Download the appropriate kernel source. Get the version you determined in Section 4.1. Kernel source can be found on this U.S. server or any one of these mirrors around the world.

Extract the source. This will take a little time.

host$ tar jxvf ~/linux-2.4.16.tar.bz2
   

or

host$ tar zxvf ~/linux-2.4.16.tar.gz
   

4.5. Adding GFS Support to the Host

To install the kernel you must be able to loopback mount the GFS root image. You need to do a few things to the host system to make that possible.

Download any version of OpenGFS after 0.0.92, or check out the latest source from CVS.

Apply the appropriate kernel patches from the kernel_patches directory to your kernel source tree. Make sure you enable the /dev filesystem, but do not have it automatically mount at boot. (When you configure the kernel select 'File systems -> /dev filesystem support' and unselect 'File systems -> /dev filesystem support -> Automatically mount at boot'.) Build the kernel as usual, install it, rewrite your boot block and reboot.

Configure, build and install the GFS modules and utilities.

host$ cd opengfs
host$ ./autogen.sh --with-linux_srcdir=host_kernel_source_tree
host$ make
host$ su
host# make install
   

Configure two aliases for one of the host's network devices. The first alias should be 192.168.50.1, and the other should be 192.168.50.101. Both should have a netmask of 255.255.255.0.

host# ifconfig eth0:0 192.168.50.1 netmask 255.255.255.0
host# ifconfig eth0:1 192.168.50.101 netmask 255.255.255.0
   

cat the contents of /proc/partitions. Select two device names that you're not using for anything else, and make two loopback devices with their names. For example:

host# mknod /dev/ide/host0/bus0/target0/lun0/part1 b 7 1
host# mknod /dev/ide/host0/bus0/target0/lun0/part2 b 7 2
   

Finally, load the necessary GFS modules and start the lock server daemon.

host# modprobe gfs
host# modprobe memexp
host# memexpd
host# Ctrl-D
   

Your host system now has GFS support.


4.6. Installing the Kernel

Loopback mount the shared root.

host$ su
host# losetup /dev/loop1 root_cidev
host# losetup /dev/loop2 root_fs
host# passemble
host# mount -t gfs -o hostdata=192.168.50.1 /dev/pool/pool0 /mnt
   

Install the modules into the root image.

host# make modules_install ARCH=um INSTALL_MOD_PATH=/mnt
host# Ctrl-D
   

4.7. Building GFS for UML

You have to repeat some of the steps you did in Section 4.5. Extract another copy of the OpenGFS source. Call it opengfs-uml. Add the following line to make/modules.mk.in.

 KSRC		:= /root/linux-ssi
 
 INCL_FLAGS	:= -I. -I.. -I$(GFS_ROOT)/src/include -I$(KSRC)/include \
+		    -I$(KSRC)/arch/um/include \
 		    $(EXTRA_INCL)
 DEF_FLAGS	:= -D__KERNEL__ -DMODULE  $(EXTRA_FLAGS)
 OPT_FLAGS	:= -O2 -fomit-frame-pointer 
   

Configure, build and install the GFS modules and utilities for UML.

host$ cd opengfs-uml
host$ ./autogen.sh --with-linux_srcdir=UML_kernel_source_tree
host$ make
host$ su
host# make install DESTDIR=/mnt
   

4.8. Building the Ramdisk

Change root into the loopback mounted root image, and use the --uml argument to cluster_mkinitrd to build a ramdisk.

host# /usr/sbin/chroot /mnt
host# cluster_mkinitrd --uml initrd-ssi.img 2.4.16-21um
   

Move the new ramdisk out of the root image, and assign ownership to the appropriate user. Wrap things up.

host# mv /mnt/initrd-ssi.img ~username
host# chown username ~username/initrd-ssi.img
host# umount /mnt
host# passemble -r all
host# losetup -d /dev/loop1
host# losetup -d /dev/loop2
host# Ctrl-D
host$ cd ..
   

5. Building a Root Image

Building your own root image is necessary if you want to use a distribution other than Red Hat 7.2. Otherwise, feel free to skip this section.

These instructions describe how to build a Red Hat 7.2 image. At the end of this section is a brief discussion of how other distributions might differ. Building a root image for another distribution is left as an exercise for the reader.


5.1. Base Root Image

Download the Red Hat 7.2 root image from the User-Mode Linux (UML) project. As with the root image you downloaded in Section 2.1, it is over 150MB.

Extract the image.

host$ bunzip2 -c root_fs.rh72.pristine.bz2 >root_fs.ext2
   

Loopback mount the image.

host$ su
host# mkdir /mnt.ext2
host# mount root_fs.ext2 /mnt.ext2 -o loop,ro
   

5.2. GFS Root Image

Make a blank GFS root image. You also need to create an accompanying lock table image. Be sure you've added support for GFS to your host system by following the instructions in Section 4.5.

host# dd of=root_cidev bs=1024 seek=4096 count=0
host# dd of=root_fs bs=1024 seek=2097152 count=0
host# chmod a+w root_cidev root_fs
host# losetup /dev/loop1 root_cidev
host# losetup /dev/loop2 root_fs
   

Enter the following pool information into a file named pool0cidev.cf.

poolname pool0cidev
subpools 1
subpool 0 0 1 gfs_data
pooldevice 0 0 /dev/loop1 0
   

Enter the following pool information into a file named pool0.cf.

poolname pool0
subpools 1
subpool 0 0 1 gfs_data
pooldevice 0 0 /dev/loop2 0
   

Write the pool information to the loopback devices.

host# ptool pool0cidev.cf
host# ptool pool0.cf
   

Create the pool devices.

host# passemble
   

Enter the following lock table into a file named gfscf.cf.

datadev:	/dev/pool/pool0
cidev:		/dev/pool/pool0cidev
lockdev:	192.168.50.101:15697
cbport:		3001
timeout:	30
STOMITH: NUN
name:none
node: 192.168.50.1	1	SM: none
node: 192.168.50.2	2	SM: none
node: 192.168.50.3	3	SM: none
node: 192.168.50.4	4	SM: none
node: 192.168.50.5	5	SM: none
node: 192.168.50.6	6	SM: none
node: 192.168.50.7	7	SM: none
node: 192.168.50.8	8	SM: none
node: 192.168.50.9	9	SM: none
node: 192.168.50.10	10	SM: none
node: 192.168.50.11	11	SM: none
node: 192.168.50.12	12	SM: none
node: 192.168.50.13	13	SM: none
node: 192.168.50.14	14	SM: none
node: 192.168.50.15	15	SM: none
   

Write the lock table to the cidev pool device.

host# gfsconf -c gfscf.cf
   

Format the root disk image.

host# mkfs_gfs -p memexp -t /dev/pool/pool0cidev -j 15 -J 32 -i /dev/pool/pool0
   

Mount the root image.

host# mount -t gfs -o hostdata=192.168.50.1 /dev/pool/pool0 /mnt
   

Copy the ext2 root to the GFS image.

host# cp -a /mnt.ext2/. /mnt
   

Clean up.

host# umount /mnt.ext2
host# rmdir /mnt.ext2
host# Ctrl-D
host$ rm root_fs.ext2
   

5.3. Getting Cluster Tools Source

Cluster Tools source code is available as official release tarballs and through CVS. The CVS repository contains the latest, bleeding-edge code. It can be less stable than the official release, but it has features and bugfixes that the release does not have.


5.3.1. Official Release

The latest release can be found at the top of the Cluster-Tools section of this release list. At the time of this writing, the latest release is 0.6.5.

Download the latest release. Extract it.

host$ tar jxvf ~/cluster-tools-0.6.5.tar.bz2
    

5.3.2. CVS Checkout

Follow these instructions to do a CVS checkout of the latest Cluster Tools code. The modulename is cluster-tools.

To do a developer checkout, you must be a CI developer. If you are interested in becoming a developer, read Section 8.3 and Section 8.4.


5.4. Building and Installing Cluster Tools

host$ su
host# cd cluster-tools
host# make install_ssi_redhat UML_ROOT=/mnt
   

5.5. Installing Kernel Modules

If you built a kernel, as described in Section 4, then follow the instructions in Section 4.4 and Section 4.7 to install kernel and GFS modules onto your new root.

Otherwise, mount the old root image and copy the modules directory from /mnt/lib/modules. Then remount the new root image and copy the modules into it.


5.7. Unmounting the Root Image

host# umount /mnt
host# passemble -r all
host# losetup -d /dev/loop1
host# losetup -d /dev/loop2
   

6. Moving to a Hardware-Based Cluster

If you plan to use SSI clustering in a production system, you probably want to move to a hardware-based cluster. That way you can take advantage of the high-availability and scalability that a hardware-based SSI cluster can offer.

Hardware-based SSI clusters have significantly higher availability. If a UML host kernel panics, or the host machine has a hardware failure, its UML-based SSI cluster goes down. On the other hand, if one of the SSI kernels panic, or one of the hardware-based nodes has a failure, the cluster continues to run. Centralized kernel services can failover to a new node, and critical user-mode programs can be restarted by the application monitoring and restart daemon.

Hardware-based SSI clusters also have significantly higher scalability. Each node has one or more CPUs that truly work in parallel, whereas a UML-based cluster merely simulates having multiple nodes by time-sharing on the host machine's CPUs. Adding nodes to a hardware-based cluster increases the volume of work it can handle, but adding nodes to a UML-based cluster bogs it down with more processes to run on the same number of CPUs.


6.1. Requirements

You can build hardware-based SSI clusters with x86 or Alpha machines. More architectures, such as IA64, may be added in the future. Note that an SSI cluster must be homogeneous. You cannot mix architectures in the same cluster.

The cluster interconnect must support TCP/IP networking. 100 Mbps ethernet is acceptable. For security reasons, it should be a private network. Each node should have a second network interface for external traffic.

Right now, the most expensive requirement of an SSI cluster is the shared drive, required for the shared GFS root. This will no longer be a requirement when CFS, which is described below, is available. The typical configuration for the shared drive is a hardware RAID disk cabinet attached to all nodes with a Fibre Channel SAN. For a two-node cluster, it is also possible to use shared SCSI, but it is not directly supported by the current cluster management tools.

The GFS shared root also requires one Linux machine outside of the cluster to be the lock server. It need not be the same architecture as the nodes in the cluster. It just has to run memexpd, a user-mode daemon. Eventually, GFS will work with a Distributed Lock Manager (DLM). This would eliminate the need for the external lock server, which is a single point of failure. It could also free up the machine to be another node in your cluster.

In the near future, the Cluster File System (CFS) will be an option for the shared root. It is a stateful NFS that uses a token mechanism to provide tight coherency guarantees. With CFS, the shared root can be stored on the internal disk of one of the nodes. The on-disk format can be any journalling file system, such as ext3 or ReiserFS.

The initial version of CFS will not provide high availability. Future versions of CFS will allow the root to be mirrored across the internal disks of two nodes. A technology such as the Distributed Replicated Block Device (DRBD) would be used for this. This is a low-cost solution for the shared root, although it has a performance penalty.

Future versions will also allow the root to be stored on a disk shared by two or more nodes, but not necessarily shared by all nodes. If the CFS server node crashes, its responsibilities would failover to another node attached to the shared disk.


6.2. Resources

Start with the installation instructions for SSI.

If you'd like to install SSI from CVS code, follow these instructions to checkout modulename ssic-linux, and these instructions to checkout modulenames ci-linux and cluster-tools. Read the INSTALL and INSTALL.cvs files in both the ci-linux and ssic-linux sandboxes. Also look at the README file in the cluster-tools sandbox.

For more information, read Section 7.


7. Further Information

Here are some links to information on SSI clusters, CI clusters, GFS, UML, and other clustering projects.


7.1. SSI Clusters

Start with the SSI project homepage. In particular, the documentation may be of interest. The SourceForge project summary page also has some useful information.

If you have a question or concern, post it to the mailing list. If you'd like to subscribe, you can do so through this web form.

If you are working from a CVS sandbox, you may also want to sign up for the ssic-linux-checkins mailing list to receive checkin notices. You can do that through this web form.


7.2. CI Clusters

Start with the CI project homepage. In particular, the documentation may be of interest. The SourceForge project summary page also has some useful information.

If you have a question or concern, post it to the mailing list. If you'd like to subscribe, you can do so through this web form.

If you are working from a CVS sandbox, you may also want to sign up for the ci-linux-checkins mailing list to receive checkin notices. You can do that through this web form.


7.3. GFS

SSI clustering currently depends on the Global File System (GFS) to provide a single root. The open-source version of GFS is maintained by the OpenGFS project. They also have a SourceForge project summary page.

Right now, GFS requires either a DMEP-equipped shared drive or a lock server outside the cluster. The lock server is the only software solution for coordinating disk access, and it is not truly HA. There are plans to make OpenGFS support IBM's Distributed Lock Manager (DLM), which would distribute the lock server's responsibilities across all the nodes in the cluster. If any node fails, the locks it managed would failover to other nodes. This would be a true HA software solution for coordinating disk access.

If you have a question or concern, post it to the mailing list. If you'd like to subscribe, you can do so through this web form.


7.4. UML

The User-Mode Linux (UML) project has a homepage and a SourceForge project summary page.

If you have a question or concern, post it to the mailing list. If you'd like to subscribe, you can do so through this web form.


8. Contributing

If you'd like to contribute to the SSI project, you can do so by testing it, writing documentation, fixing bugs, or working on new features.


8.1. Testing

While using the SSI clustering software, you may run into bugs or features that don't work as well as they should. If so, browse the SSI and CI bug databases to see if someone has seen the same problem. If not, either post a bug yourself or post a message to to discuss the issue further.

It is important to be as specific as you can in your bug report or posting. Simply saying that the SSI kernel doesn't boot or that it panics is not enough information to diagnose your problem.


8.2. Documentation

There is already some documentation for SSI and CI, but more would certainly be welcome. If you'd like to write instructions for users or internals documentation for developers, post a message to to express your interest.


8.3. Debugging

Debugging is a great way to get your feet wet as a developer. Browse the SSI and CI bug databases to see what problems need to be fixed. If a bug looks interesting, but is assigned to a developer, contact them to see if they are actually working on it.

After fixing the problem, send your patch to or . If it looks good, a developer will check it into the repository. After submitting a few patches, you'll probably be invited to become a developer yourself. Then you'll be able to checkin your own work.


8.4. Adding New Features

After fixing a bug or two, you may be inclined to work on enhancing or adding an SSI feature. You can look over the SSI and CI project lists for ideas, or you can suggest something of your own. Before you start working on a feature, discuss it first on or .