What is deep scan ?
It is one of fossid-da
’s scan modes that collects additional information about dependencies.
This mode collects copyright and compliance information from all files in every dependency package.
What information is normally returned by FossID DA about a component?
When running a normal scan this is the information gathered for one component:
{
"name": "word-wrap",
"version": "1.2.3",
"spdx": {
"purl": "pkg:npm/word-wrap@1.2.3",
"supplier_name": "jonschlinkert",
"supplier_url": "https://github.com/jonschlinkert"
},
"fda_info": {
"bin_url": "",
"homepage": "https://github.com/jonschlinkert/word-wrap",
"id": "NPM::word-wrap:1.2.3",
"license": "MIT",
"source_url": "https://registry.npmjs.org/word-wrap/-/word-wrap-1.2.3.tgz",
"vcs_type": "git",
"vcs_revision": "1.2.3",
"vcs_url": "https://github.com/jonschlinkert/word-wrap",
"additional_info": {}
}
}
where:
- name - Component name
- version - Component version
- spdx- SPDX info
- fda_info - FDA collected info about the component
How is this information collected in the normal (non-deep-scan) case?
This information is collected from a single source for each dependency type. This source is usually the software register of that dependency type.
For example:
- An NPM component will have the software registry in: npmjs.org. This is the source from which
fossid-da
will get the component info.
What is different if you enable deep scan?
The major differences between normal scanning and deep scanning are:
- The number of license sources is increased. This varies depending on the dependency type and on the info available in the software register of the component.
- The amount of compliance info is more detailed. This includes all the license info and copyright info detected in the actual source files of the dependencies
- Compatibility information is available in the results
- The scan time is increased due to data gathering and processing of the results
How do you activate it ?
This mode can be activated in two ways:
-
In fossid.conf add the following option line: da_deep_scan=1 and running a Workbench dependency scan.
-
When running fossid-da CLI add the following option: –deep-scan (See FossID-DA-CLI-Guide for more info about fossid-da CLI)
What are the deep scan limitations ?
Deep scanning does not work with components that do not have a declared version (N/A).
This affects:
- C/C++ import scans
- Go import scans
- Python import scans,
The scans will run as normal, but additional information won’t be gathered.
What does it do ?
It gathers compliance info from multiple sources and establishes and overall license.
After the dependency tree is done, fossid-da
will:
- Download the package (archive)
- Scan the package(archive) with
fossid-cli
and get a first set of results - Extract the package(archive)
- Scan all extracted files with
license-extractor
(shinobi) - Generate another set of results based on the findings from the LE scan:
- compliance info
- compatibility info
- Based on the dependency type, it will get relevant API license data
- If there is any github VCS info, get license data from github
- Combine and analyze all results and set an overall license
- Remove all downloaded and extracted data
This process improves the overview from a license compliance point of view for dependency packages.
Where does it save and extract dependency packages ?
By default, this is done in /tmp/fossid-da (fossid-da needs to have write permissions for /tmp folder).
This path can be changed in fossid.conf by changing the following line:
da_download_path="/tmp/fossid-da"
with the desired path.
NOTE: It is recommended not to change the download path and give permissions to /tmp folder.
What type of reports will this mode return ?
-
Deep scan will generate the
analyzer-result.json
report as expected, but will add another field:fda_deep_scan
to each package, in the packages list inanalyzer-result.json
report. Thefda_deep_scan
section will contain all additional info collected by FDA.... "packages": [ { "package": { "id": "PyPI::pytest:8.3.2", "purl": "pkg:pypi/pytest@8.3.2", "declared_licenses": [ "MIT" ], ... "fda_deep_scan": { "name": "pytest", "version": "8.3.2", "spdx": { ... }, "fda_info": { ... "homepage": "https://github.com/pytest-dev/pytest", "id": "PyPI::pytest:8.3.2", "license": "MIT", "vcs_type": "git", "vcs_revision": "8.3.2", "vcs_url": "https://github.com/pytest-dev/pytest", "additional_info": {} ... ... } ...
-
An additional custom report will be generated in the same location as analyzer-result.json. The custom report will be named under the following form: fda{SCAN_CODE}.json_.
What is the structure of FDA’s deep scan info ?
This is an example of one dependency object from the report:
[
...
{
"name": "pytest-cov",
"version": "1.8.1",
"spdx": {
"purl": "pkg:pypi/pytest-cov@1.8.1",
"supplier_name": "schlamar",
"supplier_url": "https://github.com/schlamar"
},
"fda_info": {
"bin_url": "",
"homepage": "https://github.com/pytest-dev/pytest-cov",
"id": "PyPI::pytest-cov:1.8.1",
"license": "MIT",
"source_url": "https://files.pythonhosted.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz",
"vcs_type": "git",
"vcs_revision": "1.8.1",
"vcs_url": "https://github.com/schlamar/pytest-cov",
"additional_info": {}
},
"fda_license_info": {
"overall_license": "MIT",
"confidence": "89.87%",
"license_data": [
{
"clearly_defined": {
"timestamp": "2024-02-05",
"rule": "https://api.clearlydefined.io/harvest/pypi/pypi/-/pytest-cov/1.8.1",
"license": "BSD-2-Clause"
}
},
{
"clearly_defined": {
"timestamp": "2024-02-05",
"rule": "https://api.clearlydefined.io/harvest/pypi/pypi/-/pytest-cov/1.8.1",
"license": "MIT"
}
},
{
"github_API": {
"timestamp": "2024-02-05",
"rule": "https://github.com/pytest-dev/pytest-cov/tree/v1.8.1",
"license": "MIT"
}
},
{
"github_API": {
"timestamp": "2024-02-05",
"rule": "https://github.com/pytest-dev/pytest-cov/tree/v1.8.1",
"license": "MIT"
}
},
{
"PyPI_API": {
"timestamp": "2024-02-05",
"rule": "https://pypi.org/project/pytest-cov/",
"license": "MIT"
}
},
{
"libraries_io_API": {
"timestamp": "2024-02-05",
"rule": "https://libraries.io/pypi/pytest-cov/1.8.1",
"license": "MIT"
}
},
{
"cli_info": {
"timestamp": "2024-02-05",
"cli_version": "3.4.14",
"license": "MIT",
"rule": "https://files.pythonhosted.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz"
}
},
{
"cli_info": {
"timestamp": "2024-02-05",
"cli_version": "3.4.14",
"license": "MIT",
"rule": "https://pypi.python.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz"
}
},
{
"download_info": {
"timestamp": "2024-02-05",
"download_url": "https://files.pythonhosted.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz",
"overall_license": "MIT",
"rule": "LICENSE.txt,setup.py",
"le_version": "1.2.0",
"le_underlying_licenses": {
"MIT": 3
},
"le_data": {
"1": {
"MIT": [
"setup.py",
{
"LICENSE.txt": [
"(c) 2010 Meme Dough"
]
}
]
},
"2": {
"MIT": [
"pytest_cov.egg-info/PKG-INFO"
]
}
}
}
}
],
"license_rule": "license_data",
"vulnerability_data": "",
"license_category": "PERMISSIVE"
},
"compliance_info": {
"company_copyrights": {},
"detected_copyleft": {},
"detected_agpl": {},
"non_spdx": {},
"company_emails": {},
"compat_issues": {
"file_vs_ov": {
"code": -1,
"desc": "self"
},
"ov_vs_ov": {
"code": -1,
"desc": "self"
},
"prj_vs_ov": {
"code": 3,
"desc": "n/a"
},
"sub_vs_ov": {
"code": 3,
"desc": "n/a"
}
}
},
"file_types": {
"total_files": 15,
"file_types": {
"PKG-INFO": 2,
"py": 3,
"in": 1,
"cfg": 1,
"rst": 1,
"txt": 6,
"not-zip-safe": 1
},
"dependency_type": "PyPI"
}
},
...
]
where:
- name - Component name
- version - Component version
- spdx - SPDX info
- fda_info - FDA collected info about the component
- fda_license_info - FDA collected info about licenses:
- overall_license - Established overall license based on the lists of licenses and sources
- confidence - Established confidence level based on the list of license sources
- license_data - List of license identifiers and license sources:
- clearly_defined - license information gathered from clearlydefined.io
- github_API - license information gathered from github.com
- PyPI_API - license information gathered from pypi.org
- libraries_io_API - license information gathered from libraries.io
- crates_API - license information gathered from crates.io
- Maven_API - license information gathered from mvnrepository.com
- NPM_API - license information gathered from npmjs.com
- conan_API - license information gathered from conan.io/center
- Hex_API - license information gathered from hex.pm
- packagist_API - license information gathered from packagist.org
- NuGet_API - license information gathered from nuget.org
- GoMod - license information gathered from pkg.go.dev
- CocoaPod_API - license information gathered from cocoapods.org
- Gem_API - license information gathered from rubygems.org
- Pub_API - license information gathered from pub.dev
- Haskell_API - license information gathered from hackage.haskell.org
- cli_info - license information gathered from FossID KB, using fossid-cli
- download_info - License and copyright info detected in the extracted dependency:
- timestamp - Detection timestamp.
- le_version - License extractor version.
- download_url - Download URL of component.
- overall_license - Established overall license based on the license information from License Extractor.
- rule - File path(s) of the license sources from where the overall license was selected.
- le_underlying_licenses - Dictionary containing the license ID’s and number of occurrences.
- le_data - License extractor data.
- license_rule - How the license was picked. This is not available at this moment.
- vulnerability_data - Detected vulnerability (CPE).
- license_category - License category that mimics Workbench’s license category matrix.
- compliance_info - FDA collected compliance info.
- company_copyrights - Detected company copyrights
- detected_copyleft - Detected copyleft licenses
- detected_agpl - Detected AGPL licenses
- non_spdx - Detected Non-SPDX licenses
- company_emails - Detected company emails in copyrights
- compat_issues - Compatibility issues (uses compatibility matrix):
- file_vs_ov - File licenses vs. Component License
- ov_vs_ov - Component licenses vs Component license (this is used for multi-licenses)
- prj_vs_ov - Project license vs. Component license
- sub_vs_ov - Transitive dependency vs. Component license
- file_types - Detected file type information
- dependency_type - Dependency type information
What is the structure of FDA’s custom report?
The custom report is structured as a list of JSON (dictionaries) objects.
What are the license sources from where fossid-da gets its information?
- General query: libraries.io
- General query: clearlydefined.io
- General query: fossid-cli/fossid KB
- CocoaPods: cocoapods.org
- CPP: conan.io/center - The Conan libraries and tools central repository
- Cargo/Crates: crates.io: Rust Package Registry
- Debian: debian.org
- Fuget: fuget.org
- Nuget: nuget.org - NuGet Gallery
- Github: https://raw.githubusercontent.com; github.com - GitHub: Let’s build from here
- Go: pkg.go.dev - Go Packages
- GoogleMaven: maven.google.com - Google’s Maven Repository
- GradlePlugins: plugins.gradle.org - Gradle - Plugins
- Haskell: hackage.haskell.org - The Haskell Package Repository
- Elixir/Hex: hex.pm - The package manager for the Erlang ecosystem
- LibrariesIo: libraries.io
- Maven: mvnrepository.com - Maven Central Repository
- NPM: npmjs.com; https://registry.npmjs.org
- PHP/Composer: packagist.org - The PHP Package Repository
- Pub: pub.dev - The official repository for Dart and Flutter packages
- PyPI: pypi.org - PyPI · The Python Package Index
- Ruby: rubygems.org - Ruby community’s gem hosting service
How does this affect the performance of the dependency scan?
Collecting all these types of information and from so different sources does affect the duration of a dependency scan.
What other option does deep scan have ?
Currently, these are the available deep scan options:
da_deep_scan=0 or 1
- Activates the actual scan.da_download_path="/tmp/fossid-da"
- Download path.da_allow_double_api_calls=0 or 1
- Allows multiple queries to the same source. This is used to get additional information from a potential source. This can affect the number of github queries.