It is about Scanning ...
FOSSology is a framework, toolbox and Web server application for examining software packages in a multi-user environment. A user can upload individual files or entire software packages. Fossology will unpack this upload if necessary and run a chosen set of agents on every file of the upload. An agent can implement any analysis operation on a text file. The FOSSology package as of now focuses on license relevant data. However, it could be extended with analyses for different purposes (e.g. static code analysis).
Regular Expression Scanning for Licenses with Nomos
Nomos is one of FOSSology's license scanners. Nomos does license identification using short phrases (regular expressions) and heuristics, e.g. a phrase must be found in (or out of) proximity to another phrase or phrases. This helps to eliminate false positives.
Nomos uses stages of license recognition: First it uses keywords to identify license relevant statements. Then, it uses a hierarchical structure of regular expressions in order to identify particular licenses.
If the recognition is not complete, Nomos will either return 'UnclassifedLicense' or a category of licenses, such as 'BSD' or 'GPL' – not enough to identify unambiguously a license, but as much as possible to support the user to determine the license situation. What happened in this case is that in the hierarchy of matching phrases, the found text "cannot do deep enough" to determine a particular license. Note that 'BSD' or 'GPL' require to determine the exact license, such as 'BSD 3 Clause' or 'GPL 2.0'.
But the fact, that Nomos identifies a "style" type of license if it has similarities with a known license type enables Nomos to recognize also new or unknown licenses.
Text-Similarity Matching with Monk
Monk is another one of FOSSology's license scanners. Monk performs text based searches and thus requires good license texts/patterns to search for. It uses the Jaccard index as a text similarity metric added with a weighting for ranking different matches by their size. Ranking different matches by their size is relevant if license texts are very similar and result in different Jaccard text similarity numbers (e.g. different versions of the BSD). In this case not only the best match but also the largest match metric is relevant.
At upload Monk uses the license texts stored in the Fossology server. Using the monk agent, the user can also define own text phrases to identify a given license.
NOTE: Monk will tell the user a score of the existing license, however, it will not be able to recognize an unknown (new) license. That shows the sense to have two license scanners, Nomos and Monk.
FOSSology searches for copyrights based on the text fragments 'copyright' and '(C)', but tries to filter out a lot of false positive matches. The results are shown in a (mostly large) table. Then a user can review and correct copyright statements. In addition, FOSSology also finds more potentially copyright relevant statements, such as ‘authored by’, ‘written by’, etc.
Export Control Codes (ECC)
As another agent, the keyword agent allows for file based definition for own regular expression that FOSSology should look for. Such keywords do not necessarily refer to licensing issues, but could also involve all kind of searches. As an example of the file-based keyword search, FOSSology searches for selected terms that may hint to ECC (export control and customs) relevant statements. The interface of FOSSology lets the user review these findings.
The user can define so named buckets – FOSSology puts files matching user defined conditions, w.r.t. their license situation into particular buckets. If the user searches for particular license occurrences, FOSSology presents them in consolidated listings, conveniently for further processing.
... and Reviewing!
One of the key features of FOSSology is the user interface to review license findings in order to determine the exact licensing of a file. The FOSSology UI offers beautiful text highlighting and super fast page loading even with large software package, such as the boost library or the linux kernel.
License Text Management
FOSSology maintains licenses and their license texts in its database. These licenses can be reviewed and managed with additional metadata. In the Web-based user interface, the user can add also new / own license’s either on an individual basis or by upload of a CSV (comma separated values) file. Also the user can override the given license text with the actually found license info in file – allowing for most precise reporting possible on found license texts
Mark License as Global License
When reviewing license findings, the user recognized that one or more licenses actually represent the main license of a software package (a.k.a. global license). A user can capture the main license setting by marking the license occurrence in the single files view (using the ‘star’ icon).
If an upload contains a new license text or a reoccurring file notice referring to a license, the clearing work can be become tedious: The user needs o review ever of the unknown occurrences. Actually, browsing through an upload on a file by file basis reviewing an unknown license finding can turn into an almost endless task when considering larger packages, such as the Linux kernel, the open office source code or the boost library.
However, in many projects, the same file notice text is provided across many files. Then, the idea of bulk-wise processing files is obvious considering a high number of files with exactly same license texts.
With bulk recognition, the user defines a text phrase and associates a license entry or for all of the matched files and sets the license situation with a few click with thousands of files..
Aggregated File View
Reviewing large uploads for the file-by-file license situation requires an efficient way of browsing the file structure. Fossology offers aggregates views both by license occurrence or by folder hierarchy. Using this aggregation, the user can drill down to unsolved corners of the upload, get an overview about the overall license situation or find areas that require an additional review.
Reuse of License Reviews
Last but now least, if the license situation stays the same for a file (based on the file hash), FOSSology reuses scan results as well as license review decisions. During the lifetime of a software project, typically a series of versions of the same components is used. Because not every file will be changed between version changes, the license review work can be drastically reduced by copying review corrections for files sharing the same hash value.
Tell the World
Providing the complete picture, FOSSology lets you export the found information in various ways and formats. Users can generate:
- Readme files for the distribution containing all identified license texts and copyright information.
- List of files in hierarchical structure with found licenses identified by the short name identifier.
- SPDX 2.0 export using the tag-value and the RDF-(XML)-format.
- Debian-copyright (a.k.a. DEP5) files.
There is a lot more in Fossology, such as handling of different teams working on the same server. Most functionality can be called form the command line – so FOSSology can be integrated into automated workflows. FOSSology creates statistics on your uploads and can organize your uploads in folders with metadata, so you organize the review work between your project teams.