Package Install Automation
Version 5 (Mark Donohoe, 04/14/2012 01:06 am)
h1. Package Install Automation
This page documents the current state of the package automation. See "Package Install Testing":[[Package Install Testing]] for details on setting up the virtual machines used to automate the testing.
The automation makes heavy use of VM features. A vm snapshot is taken of each vm machine. After the test is complete (package install and tests run) the vm is __reverted__ to it's previously __fossology not installed__ state. This allows rapid turnaround of testing. If this method was not used, a template would have to be inflated into a new vm, the vm would have to be configured for dhcp then brought on-line.
h2. Process Overview
Each step below is a separate job in jenkins.
* Packages are built first, the configuration files needed by the script are checked in by the package build process.
* The packages are installed using a php script running as root on the slaves.
* The tests are launched either by hand or as part of the job chaining of jenkins.
Jenkins does a lot of things well. Like most software it's not perfect. One of the most annoying things about jenkins (at least using FF on linux) is that it often presents stale pages. When using jenkins, use the refresh button on your browser a lot. Sometimes turning on 'auto refresh' helps, but not always. For example, if you are trying to check on the state of slave nodes, you should hit the browser refresh and the refresh status button in jenkins.
You have been warned. :-)
h1. Automation issues
The issues described here are caused by interactions between the VM's and Jenkins.
* When vm's have their snapshot reverted a number of bad side effects occur:
** Some vm's lose their IP address randomly. It doesn't always happen, but it does happen on just a few machines, enough to cause grief.
** For some reason when the vm's are reverted, their time gets messed up as well. They often end up days behind the master server. This causes test failures because the right set of sources don't get checked out (it checks out TOT for that day/time). So after reverting snapshots human intervention is needed if only just to check on the state of the vm's in relation as to how jenkins sees them.
** The worst effect of reverting the vm's is that jenkins often then loses contact with them. It tries to recover and reconnect, but sometimes a number of nodes will still be off line and have to be manually brought back on line using jenkins.
h2. Known failures
Package-Install-Testing is a Matrix Job. When a matrix job starts it always starts a parent job by picking from the list of nodes at random and starts a __parent job__ on that node. That job is called **Install-Test**. That job will almost always fail, and the failure should be ignored. Looking at the console output, the machine picked is indicated. When the Install-Test job passes, often the slave with the machine name will fail due to apt/dpkg lock conflicts. For example, if the Install-Tests picks slave squeze32 to start the parent job and it passes, the slave called squeze32 might pass or it might fail. In this situation as long as one of them passes, the other failure can be ignored.
The issue is that Jenkins counts the job Install-Test as part of the run... since it fails, even though the packages all install clean and all the other jobs succeed, the complete job is marked failed due to the failure of Install-Test.
h3. Startup Failures
Before starting up the Package-Install-Test all of the nodes (slaves) should be checked to make sure they are on line and that their time is synchronized to within minutes of each other. The issues mentioned above make checking the nodes important so that startup failures are minimized.
To check nodes, go to "Jenkins Dashboard":http://fonightly.usa.hp.com:8080 and select the +Manage Jenkins+ link, then on that page select the +Manage Nodes+ link.
* Time is out of sync: login to the machine and run (as root)<code>/etc/init.d/npt restart</code> or if rhel based, <code>service ntpd restart</code> If that doesn't work, reset the date by hand(as root):<code>date mmddhhmmyy</code>
* machine has lost it's IP, for this the vmware client software should be used to get to the console and restart networking or network-manager(debian/ubuntu). If that doesn't work, then reboot the machine. Restarting networking is faster than rebooting.