A Simple "word count" Agent¶
In this tutorial, we will write an agent to count words in a file and store the result in the database.
say something about general agent creation
talk about how the agent is run thru the schedule
wc_agent uses the engine-shell so, you need to talk about that
To start off, let's introduce four components common to all agents: - The agent itself. This performs the analysis and stores results in the database. Agents can be built in any language – from shell script to C. - The Scheduler. Agents are executed through a scheduler. - The Interface. The user interface (UI) or command-line interface (CLI) schedules a job (individual executions of the agent) via a //jobqueue// and displays any results. - The jobqueue. Every job must have an associated //jobqueue record// containing agent-specific arguments. The arguments, can be either sql for the agent to execute or data entered by the user through the interface. The jobqueue record is passed to the agent by the scheduler. The jobqueue operates in two modes: generic and per-host. The basic idea is that the file repository may be split across hosts. Rather than transfering files across the network (e.g., NFS), it may be faster for agents to run on the same host as the file. For example, the wget_agent downloads a file from the Internet and stuffs it into the repository. Since the repository host is unknown, wget_agent can really run on any host. This is an example of a generic agent. In contrast, the license analysis agents process files in the repository. Since the hosts are known, it is faster to run these agents on the specific file. There is one other distinction: the generic-host entries in the jobqueue contain one request. The value of the jobqueue.jq_args is passed as-is to the agent and the agent is assumed to know how to parse the line. In contrast, the host-specific agents have an SQL line in the jobqueue.jq_args. The scheduler runs the SQL and sends the results of this multi-SQL query (MSQ) to the agent. The difference between generic-host and MSQ is critical: if an agent needs to perform a task on hundreds of DB items, then it either needs to process the SQL query itself (using parameters from the jq_args), or it needs to process one item that the scheduler retrieves using the MSQ. Since this example wc agent is expected to run on thousands of files in the repository, it is a good idea to use the host-specific, MSQ option. With MSQ queries, we need to know the data and the stop condition. The stop condition identifies when the file has been processed. In this example, there is a custom table, "agent_wc", for storing results. The SQL for the jq_args should return every pfile and repository file name associated with the project and that does not already exist in the agent_wc table: SELECT pfile_sha1 || '.' || pfile_md5 || '.' || pfile_size AS pfile, pfile_fk FROM uptreeup WHERE upload_fk = 619 AND pfile_fk NOT IN (SELECT agent_wc.pfile_fk FROM agent_wc) LIMIT 5000; The "LIMIT 5000" ensures that this job does not hog all of the scheduler's resources. The "619" is an example -- it should match the upload_fk for the project and be set by the Interface. Assuming everything gets processed, this will return no rows when everything is done processing. That's how the scheduler will know that there is no more work to perform. Since this job should run on host-specific fields, the jobqueue.jq_runonpfile should be set to "pfile". This is the name of the column from the SQL that denotes the host-specific information.
#!/bin/bash # Example wc agent, written in shell script. # This should be used with engine-shell. # # Copyright (C) 2007 Hewlett-Packard Development Company, L.P. # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # version 2 as published by the Free Software Foundation. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License along # with this program; if not, write to the Free Software Foundation, Inc., # 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. # Set the path. # If the paths in Makefile.conf change, then these will need to change. export PATH=/usr/bin:/usr/local/fossology:/usr/local/fossology/agents:/usr/local/fossology/test.d # This agent should appear in the scheduler.conf as: # agent=wc | /usr/local/fossology/agents/engine-shell wc_agent '/usr/local/fossology/agents/wc_agent' # engine-shell will convert all of the SQL columns into environment # variables. The MSQ will return pfile=... and pfile_fk=... # These will become $ARG_pfile and $ARG_pfile_fk. if [ "$ARG_pfile" == "" ] ; then echo "FATAL: \$ARG_pfile not set. Abording." exit -1 fi if [ "$ARG_pfile_fk" == "" ] ; then echo "FATAL: \$ARG_pfile_fk not set. Abording." exit -1 fi # Get the path to the actual file RepFile=`reppath files "$ARG_pfile"` # Get the word-count values and insert them into the database using dbinit. wc "$RepFile" 2>/dev/null | while read Lines Words Bytes Name ; do # Convert wc to an SQL statement echo "!INSERT INTO agent_wc (pfile_fk,wc_words,wc_lines) VALUES ($ARG_pfile_fk,$Words,$Lines);" # The initial "!" tells dbinit to ignore insert failures. # Don't worry about checking if the value exists... If it did exist, then # the MSQ would have never called this program. # And if two agents happen to run on the same data, then the DB constraint # for unique values will prevent duplicates. done | dbinit - exit 0; # done successfully