FOSSology: Multi System Setup¶
FOSSology: How To Install Multiple Hosts¶
Notes: Configuration files, primarily fossology.conf (in /usr/local/etc/fossology/fossology.conf) and agent specific config files on each of the agent systems (in /usr/local/etc/fossology/mods-enabled/<agent>/), must be modified after running make install and prior to running fo-postinstall.
FOSSology: How To Configure Multiple Hosts¶
The scheduler and repository are designed so they can be distributed across multiple hosts. There are few reasons for doing this:
- I/O Bottlenecks. Some of the agents (e.g., the unpack agent) are I/O intensive. As a result, running two on the same host will likely be much slower than running two in serial (one at a time) on the same host. If you want to run two unpack agents in parallel, you should run them on different systems (or on one mega system that has multiple I/O channels to relieve the bottleneck).
- CPU Bottlenecks. The slowest agents are the analyzers, such as the license analyzer. While these can easily be used in parallel, each consumes a full CPU. Running three analyzers in parallel on a system with one CPU will not speed anything up. However, if you have multiple CPUs on many different computers, then you can use them.
The ideal configuration distributes the repository across hosts and runs agents on those hosts. This way, the data used by the agents is local rather than transferred over the network.
Part 1: The Repository¶
The repository (repo) is just a collection of directories on the file system. The directory's location is defined in the fossology.conf file. (I'll refer to the path as $Repo in this document, but the default location is /srv/fossology/repository/.)
The layout of the repository is as follows:
$Repo/host/files/##/##/##/
$Repo/host/gold/##/##/##/
Where "host" is just a string (not required to be a hostname) and "##" is
a hexadecimal number. For example:
$Repo/localhost/files/01/e4/2f/01e42f923c85.txt
The [REPOSITORY] section in fossology.conf identifies the name of the host and the directories under it. For example:
[REPOSITORY]
sirius[] = * 00 7f
buckbeak[] = * 80 ff
This will create four directories:
$Repo/sirius/files/ The subdirectories are the range 00 to 7f.
$Repo/sirius/gold/ The subdirectories are the range 00 to 7f.
$Repo/buckbeak/files/ The subdirectories are the range 80 to ff.
$Repo/buckbeak/gold/ The subdirectories are the range 80 to ff.
Now you can use $Repo/sirius/ and $Repo/buckbeak/ as mount-points for remote file systems. The separation of 00-7f and 80-ff should generally split the repository in half. (The split may not be equal in size, but it should be close.) The subdirectories are named after the SHA1 checksum of the files, so this should be a fairly even split due to random data.
The repository must be writable by the group "fossy". To ensure that all files are group accessible, the directories should be set with the permissions "g+rwxs". By setting the SGID big (g+s) on the directory, all files and directories will regain the group permissions. The big catch here is that all mounted filesystems must use the same group ID for "fossy". Ideally, the top directories should be owned by user "fossy" and have the same user ID on all systems.
Part 2: The Scheduler¶
Now that you have split the repository across mount points, we will configure fossology to distribute analysis across the same hosts.
The fossology.conf file (default source install location is /usr/local/etc/fossology/fossology.conf) contains a [HOSTS] section to specify the set of hosts available to analyze files. The default entry looks like this:
[HOSTS]
localhost = localhost AGENT_DIR 10
AGENT_DIR is a fossology environment variable representing the path to the mods-enabled directory. The number 10 represents the maximum number of modules that can be run simultaneously on that host. The entry above is only valid for a single fossology system; I.e. The repository, database and web server all reside on one system. For a multi-system configuration, you can utilize multiple systems to distribute the load. Here are two example scenarios.
Scenario 1: Distributed Repository, Local Agents¶
If you lack disk space for the Repo on the local system, you can distribute the repository and still use the local CPUs for running agents. This configuration is not ideal since all communication to the repository will be done over the network (significant speed impact). However, if you need the disk space then this is an option. The repository files will be used regardless of where they are remotely hosted.
For this configuration, you can use the default [HOSTS] entry in the fossology.conf file and the scheduler will run all agents locally.
Scenario 2: Distributed Repository, Distributed Agents¶
This is the best, usual and expected scenario since modules can run on the same systems where the repository data resides. You will need to add entries in the [HOSTS] section of fossology.conf for each additional agent. For example:
[HOSTS]
localhost = localhost AGENT_DIR 10
sirius = sirius.hp.com /etc/fossology 10
buckbeak = buckbeak.hp.com /etc/fossology 10
The scheduler uses the ssh command to start jobs on the hosts. The path to the mods-enabled directory (AGENT_DIR on localhost and "/etc/fossology" on sirius & buckbeak) is used to build the ssh command. The command is executed as user fossy. Therefore, fossy needs to have ssh passphrase-less access to all hosts. There are many examples of how to do this on the web. Here is one: http://www.debian-administration.org/articles/152
- Create SSH keys for the fossy user and distribute them on all hosts. The keys should NOT include a pass-phrase. (Since the scheduler cannot enter a password, a require pass-phrase will cause the remote execution to fail.)
- Test the keys (including accepting the server key for the first connection). You should be able to login without a password.
Testing the Configuration¶
When you have finished configuring fossology.conf, you can test it with the scheduler command. As the user "fossy", run this command:
/usr/local/share/fossology/scheduler/agent/fo_scheduler -t
This will attempt to spawn every agent. If there are any errors, it will tell you which command failed. Some common failure causes:
- Wrong user or group. The scheduler and all agents run as user "fossy" and group "fossy". If this is not the case, then programs will fail to run. (If you run the scheduler as root, it will change itself to run as user fossy.)
- Bad path. Every agent entry must have the pathname to the mods-enabled directory set correctly.
- Typographical errors. If a hostname is misspelled, things will fail to run.
- Password required. If the remote login system requires a password, then it will not work with the scheduler. As user "fossy", try the remote login command and see if it works without a password.