FossID Documentation

FossID-DA Deep Scan

What is deep scan ?

It is one of fossid-da’s scan modes that collects additional information about dependencies.

This mode collects copyright and compliance information from all files in every dependency package.

What information is normally returned by FossID DA about a component?

When running a normal scan this is the information gathered for one component:

{
    "name": "word-wrap",
    "version": "1.2.3",
    "spdx": {
        "purl": "pkg:npm/word-wrap@1.2.3",
        "supplier_name": "jonschlinkert",
        "supplier_url": "https://github.com/jonschlinkert"
    },
    "fda_info": {
        "bin_url": "",
        "homepage": "https://github.com/jonschlinkert/word-wrap",
        "id": "NPM::word-wrap:1.2.3",
        "license": "MIT",
        "source_url": "https://registry.npmjs.org/word-wrap/-/word-wrap-1.2.3.tgz",
        "vcs_type": "git",
        "vcs_revision": "1.2.3",
        "vcs_url": "https://github.com/jonschlinkert/word-wrap",
        "additional_info": {}
    }   
}

where:

  • name - Component name
  • version - Component version
  • spdx- SPDX info
  • fda_info - FDA collected info about the component

How is this information collected in the normal (non-deep-scan) case?

This information is collected from a single source for each dependency type. This source is usually the software register of that dependency type.

For example:

  • An NPM component will have the software registry in: npmjs.org. This is the source from which fossid-da will get the component info.

What is different if you enable deep scan?

The major differences between normal scanning and deep scanning are:

  1. The number of license sources is increased. This varies depending on the dependency type and on the info available in the software register of the component.
  2. The amount of compliance info is more detailed. This includes all the license info and copyright info detected in the actual source files of the dependencies
  3. Compatibility information is available in the results
  4. The scan time is increased due to data gathering and processing of the results

How do you activate it ?

This mode can be activated in two ways:

  1. In fossid.conf add the following option line: da_deep_scan=1 and running a Workbench dependency scan.

  2. When running fossid-da CLI add the following option: –deep-scan (See FossID-DA-CLI-Guide for more info about fossid-da CLI)

What are the deep scan limitations ?

Deep scanning does not work with components that do not have a declared version (N/A).

This affects:

  • C/C++ import scans
  • Go import scans
  • Python import scans,

The scans will run as normal, but additional information won’t be gathered.

What does it do ?

It gathers compliance info from multiple sources and establishes and overall license.

After the dependency tree is done, fossid-da will:

  1. Download the package (archive)
  2. Scan the package(archive) with fossid-cli and get a first set of results
  3. Extract the package(archive)
  4. Scan all extracted files with license-extractor (shinobi)
  5. Generate another set of results based on the findings from the LE scan:
    • compliance info
    • compatibility info
  6. Based on the dependency type, it will get relevant API license data
  7. If there is any github VCS info, get license data from github
  8. Combine and analyze all results and set an overall license
  9. Remove all downloaded and extracted data

This process improves the overview from a license compliance point of view for dependency packages.

Where does it save and extract dependency packages ?

By default, this is done in /tmp/fossid-da (fossid-da needs to have write permissions for /tmp folder).

This path can be changed in fossid.conf by changing the following line:

da_download_path="/tmp/fossid-da" 

with the desired path.

NOTE: It is recommended not to change the download path and give permissions to /tmp folder.

What type of reports will this mode return ?

  1. Deep scan will generate the analyzer-result.json report as expected, but will add another field: fda_deep_scan to each package, in the packages list in analyzer-result.json report. The fda_deep_scan section will contain all additional info collected by FDA.

    ...
     "packages": [
       {
         "package": {
             "id": "PyPI::pytest:8.3.2",
             "purl": "pkg:pypi/pytest@8.3.2",
             "declared_licenses": [
                 "MIT"
             ],
             ...
             "fda_deep_scan": {
                 "name": "pytest",
                 "version": "8.3.2",
                 "spdx": {
                   ...
                 },
                 "fda_info": {
                     ...
                     "homepage": "https://github.com/pytest-dev/pytest",
                     "id": "PyPI::pytest:8.3.2",
                     "license": "MIT",
                     "vcs_type": "git",
                     "vcs_revision": "8.3.2",
                     "vcs_url": "https://github.com/pytest-dev/pytest",
                     "additional_info": {}
                     ...
             ...
         }
     ...
    
  2. An additional custom report will be generated in the same location as analyzer-result.json. The custom report will be named under the following form: fda{SCAN_CODE}.json_.

What is the structure of FDA’s deep scan info ?

This is an example of one dependency object from the report:

[
  ...
  {
        "name": "pytest-cov",
        "version": "1.8.1",
        "spdx": {
            "purl": "pkg:pypi/pytest-cov@1.8.1",
            "supplier_name": "schlamar",
            "supplier_url": "https://github.com/schlamar"
        },
        "fda_info": {
            "bin_url": "",
            "homepage": "https://github.com/pytest-dev/pytest-cov",
            "id": "PyPI::pytest-cov:1.8.1",
            "license": "MIT",
            "source_url": "https://files.pythonhosted.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz",
            "vcs_type": "git",
            "vcs_revision": "1.8.1",
            "vcs_url": "https://github.com/schlamar/pytest-cov",
            "additional_info": {}
        },
        "fda_license_info": {
            "overall_license": "MIT",
            "confidence": "89.87%",
            "license_data": [
                {
                    "clearly_defined": {
                        "timestamp": "2024-02-05",
                        "rule": "https://api.clearlydefined.io/harvest/pypi/pypi/-/pytest-cov/1.8.1",
                        "license": "BSD-2-Clause"
                    }
                },
                {
                    "clearly_defined": {
                        "timestamp": "2024-02-05",
                        "rule": "https://api.clearlydefined.io/harvest/pypi/pypi/-/pytest-cov/1.8.1",
                        "license": "MIT"
                    }
                },
                {
                    "github_API": {
                        "timestamp": "2024-02-05",
                        "rule": "https://github.com/pytest-dev/pytest-cov/tree/v1.8.1",
                        "license": "MIT"
                    }
                },
                {
                    "github_API": {
                        "timestamp": "2024-02-05",
                        "rule": "https://github.com/pytest-dev/pytest-cov/tree/v1.8.1",
                        "license": "MIT"
                    }
                },
                {
                    "PyPI_API": {
                        "timestamp": "2024-02-05",
                        "rule": "https://pypi.org/project/pytest-cov/",
                        "license": "MIT"
                    }
                },
                {
                    "libraries_io_API": {
                        "timestamp": "2024-02-05",
                        "rule": "https://libraries.io/pypi/pytest-cov/1.8.1",
                        "license": "MIT"
                    }
                },
                {
                    "cli_info": {
                        "timestamp": "2024-02-05",
                        "cli_version": "3.4.14",
                        "license": "MIT",
                        "rule": "https://files.pythonhosted.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz"
                    }
                },
                {
                    "cli_info": {
                        "timestamp": "2024-02-05",
                        "cli_version": "3.4.14",
                        "license": "MIT",
                        "rule": "https://pypi.python.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz"
                    }
                },
                {
                    "download_info": {
                        "timestamp": "2024-02-05",
                        "download_url": "https://files.pythonhosted.org/packages/11/4b/b04646e97f1721878eb21e9f779102d84dd044d324382263b1770a3e4838/pytest-cov-1.8.1.tar.gz",
                        "overall_license": "MIT",
                        "rule": "LICENSE.txt,setup.py",
                        "le_version": "1.2.0",
                        "le_underlying_licenses": {
                            "MIT": 3
                        },
                        "le_data": {
                            "1": {
                                "MIT": [
                                    "setup.py",
                                    {
                                        "LICENSE.txt": [
                                            "(c) 2010 Meme Dough"
                                        ]
                                    }
                                ]
                            },
                            "2": {
                                "MIT": [
                                    "pytest_cov.egg-info/PKG-INFO"
                                ]
                            }
                        }
                    }
                }
            ],
            "license_rule": "license_data",
            "vulnerability_data": "",
            "license_category": "PERMISSIVE"
        },
        "compliance_info": {
            "company_copyrights": {},
            "detected_copyleft": {},
            "detected_agpl": {},
            "non_spdx": {},
            "company_emails": {},
            "compat_issues": {
                "file_vs_ov": {
                    "code": -1,
                    "desc": "self"
                },
                "ov_vs_ov": {
                    "code": -1,
                    "desc": "self"
                },
                "prj_vs_ov": {
                    "code": 3,
                    "desc": "n/a"
                },
                "sub_vs_ov": {
                    "code": 3,
                    "desc": "n/a"
                }
            }
        },
        "file_types": {
            "total_files": 15,
            "file_types": {
                "PKG-INFO": 2,
                "py": 3,
                "in": 1,
                "cfg": 1,
                "rst": 1,
                "txt": 6,
                "not-zip-safe": 1
            },
            "dependency_type": "PyPI"
        }
    },
  ...
]    

where:

  • name - Component name
  • version - Component version
  • spdx - SPDX info
  • fda_info - FDA collected info about the component
  • fda_license_info - FDA collected info about licenses:
    • overall_license - Established overall license based on the lists of licenses and sources
    • confidence - Established confidence level based on the list of license sources
    • license_data - List of license identifiers and license sources:
      • clearly_defined - license information gathered from clearlydefined.io
      • github_API - license information gathered from github.com
      • PyPI_API - license information gathered from pypi.org
      • libraries_io_API - license information gathered from libraries.io
      • crates_API - license information gathered from crates.io
      • Maven_API - license information gathered from mvnrepository.com
      • NPM_API - license information gathered from npmjs.com
      • conan_API - license information gathered from conan.io/center
      • Hex_API - license information gathered from hex.pm
      • packagist_API - license information gathered from packagist.org
      • NuGet_API - license information gathered from nuget.org
      • GoMod - license information gathered from pkg.go.dev
      • CocoaPod_API - license information gathered from cocoapods.org
      • Gem_API - license information gathered from rubygems.org
      • Pub_API - license information gathered from pub.dev
      • Haskell_API - license information gathered from hackage.haskell.org
      • cli_info - license information gathered from FossID KB, using fossid-cli
      • download_info - License and copyright info detected in the extracted dependency:
        • timestamp - Detection timestamp.
        • le_version - License extractor version.
        • download_url - Download URL of component.
        • overall_license - Established overall license based on the license information from License Extractor.
        • rule - File path(s) of the license sources from where the overall license was selected.
        • le_underlying_licenses - Dictionary containing the license ID’s and number of occurrences.
        • le_data - License extractor data.
      • license_rule - How the license was picked. This is not available at this moment.
      • vulnerability_data - Detected vulnerability (CPE).
      • license_category - License category that mimics Workbench’s license category matrix.
  • compliance_info - FDA collected compliance info.
    • company_copyrights - Detected company copyrights
    • detected_copyleft - Detected copyleft licenses
    • detected_agpl - Detected AGPL licenses
    • non_spdx - Detected Non-SPDX licenses
    • company_emails - Detected company emails in copyrights
    • compat_issues - Compatibility issues (uses compatibility matrix):
      • file_vs_ov - File licenses vs. Component License
      • ov_vs_ov - Component licenses vs Component license (this is used for multi-licenses)
      • prj_vs_ov - Project license vs. Component license
      • sub_vs_ov - Transitive dependency vs. Component license
  • file_types - Detected file type information
  • dependency_type - Dependency type information

What is the structure of FDA’s custom report?

The custom report is structured as a list of JSON (dictionaries) objects.

What are the license sources from where fossid-da gets its information?

  1. General query: libraries.io
  2. General query: clearlydefined.io
  3. General query: fossid-cli/fossid KB
  4. CocoaPods: cocoapods.org
  5. CPP: conan.io/center - The Conan libraries and tools central repository
  6. Cargo/Crates: crates.io: Rust Package Registry
  7. Debian: debian.org
  8. Fuget: fuget.org
  9. Nuget: nuget.org - NuGet Gallery
  10. Github: https://raw.githubusercontent.com; github.com - GitHub: Let’s build from here
  11. Go: pkg.go.dev - Go Packages
  12. GoogleMaven: maven.google.com - Google’s Maven Repository
  13. GradlePlugins: plugins.gradle.org - Gradle - Plugins
  14. Haskell: hackage.haskell.org - The Haskell Package Repository
  15. Elixir/Hex: hex.pm - The package manager for the Erlang ecosystem
  16. LibrariesIo: libraries.io
  17. Maven: mvnrepository.com - Maven Central Repository
  18. NPM: npmjs.com; https://registry.npmjs.org
  19. PHP/Composer: packagist.org - The PHP Package Repository
  20. Pub: pub.dev - The official repository for Dart and Flutter packages
  21. PyPI: pypi.org - PyPI · The Python Package Index
  22. Ruby: rubygems.org - Ruby community’s gem hosting service

How does this affect the performance of the dependency scan?

Collecting all these types of information and from so different sources does affect the duration of a dependency scan.

What other option does deep scan have ?

Currently, these are the available deep scan options:

  • da_deep_scan=0 or 1 - Activates the actual scan.
  • da_download_path="/tmp/fossid-da" - Download path.
  • da_allow_double_api_calls=0 or 1 - Allows multiple queries to the same source. This is used to get additional information from a potential source. This can affect the number of github queries.