Skip to content

Output

SOMEF supports three main output formats. Each of them contains different information with different levels of granularity. Below we enumerate them from more granular to less granular:

JSON format

Version: 1.0.1

Default SOMEF response (and more complete in terms of metadata). The JSON format returns a set of categories, as shown in the snippet below:

{
  "<categoryName>": [
    {
      ...
    }
  ],
  "<categoryName2>": [
    {
      ...
    }
  ],
  "somef_provenance":{    
    "date": "2022-05-20 12:00:00",
    "somef_version": "0.9.1", 
    "somef_schema_version":"1.0.0"
  }
}

In the snippet, each <categoryName> corresponds to the different categories SOMEF was able to find. An additional JSON field called somef_provenance returns provenance information of the SOMEF execution. The somef_provenance field always has the same two properties, as shown in the table below:

Property Mandatory? Expected value Definition
date Yes Date Date when the extraction was performed. Knowing the date is critical, as a repository may change its README file.
somef_version Yes String Version of SOMEF used to extract metadata from a code repository.
somef_schema_version Yes String Version of SOMEF schema used to represent the JSON output format.

Info

If a property is mandatory then it will always be returned in the output JSON.

Category

Each extracted metadata category is returned as a list, which contains the number of results SOMEF found about that category when exploring a code repository. For example, the snippet below shows a repository with two descriptions (a short one extracted from the GitHub API and a longer one extracted from the README file). SOMEF aims to return both of them:

"description": [
  {
    "result": {
        "value": "KGTK is a Python library ...",
        "type": "text"
    },
    "confidence": 0.8294290479925978,
    "technique": "Supervised classification",
    "source": "<url to readme file>"
  },
  {
    "result": {
        "value": "Python library for large KG manipulation",
        "type": "text"
    },
    "confidence": 1,
    "technique": "GitHub API"
  }
]

For each element of the list, SOMEF returns a result object, together with its confidence value, the technique used in the extraction and the source file where it's coming from (in case there is one). For example, in the snippet above, SOMEF extracted a description from the README file of the repository using supervised classification, and a short description using the GitHub API.

The confidence depends on the technique used. In this case, the confidence is driven by the classifier which makes the prediction. For the GitHub API the confidence is higher, as it was a description added manually by the authors.

SOMEF aims to recognize the following categories (in alphabetical order):

  • acknowledgement: Any text that the authors have prepared to acknnowledge the contribution from others, or project funding.
  • application_domain: The application domain of the repository. This may be related to the research area of a software component (e.g., Astrophysics) or the general domain/functionality of the tool (i.e., machine learning projects). See all current recognized application domains here.
  • application_type: Software type: Commandline Application, Notebook Application, Ontology, Scientific Workflow. Non-Software types: Static Website, Uncategorized
  • author: Person(s) or organization(s) responsible of the project. This property is also used to indicate the responsible entities of a publication associated with the code repository.
  • citation: Software citation (usually in .bib or .cff format). SOMEF extracts and structures the metadata from these files (including authors, titles, and DOIs) instead of just returning a raw string.
  • code_of_conduct: Link to the code of conduct file of the project
  • code_repository: Link to the source code (typically the repository where the readme can be found)
  • contact: Contact person responsible for maintaining a software component.
  • continuous_integration: Link to continuous integration service, supported on GitHub as well as in GitLab.
  • contributing guidelines: Guidelines indicating how to contribute to a software component.
  • contributor: Contributors to this software. Note: Contributor metadata is exported from metadata files (e.g., CodeMeta, CONTRIBUTORS, etc.) not from git logs.
  • copyright_holder: Entity or individual owning the rights to the software. The year is also extracted, if available.
  • date_created: Date when the software component was created.
  • date_updated: Date when the software component was last updated (note that this will always be older than the date of the extraction).
  • description: A description of what the software component does.
  • development_status: The project’s development stage: beta, deprecated...
  • documentation: Where to find additional documentation about a software component.
  • download_url: URL where to download the target software (typically the installer, package or a tarball to a stable version)
  • executable_example: Jupyter notebooks ready for execution (e.g., through myBinder, colab or files)
  • faq: Frequently asked questions about a software component
  • forks_count: Number of forks of the project at the time of the extraction.
  • forks_url: Links to forks made of the project (GitHub only)
  • full_name: Name + owner (owner/name) (if available)
  • full_title: If the repository has a short name, we will attempt to extract the longer version of the repository name. For example, a repository may be called "Widoco", but the longer title is "Wizard for documenting ontologies".
  • funding: Funding code for the related project. Currently, this information is only extracted from existing codemeta.json files within the repository.
  • has_build_file: Build file to create a Docker image for the target software
  • has_package_file: Specifies what package file is present in the code repository.
  • has_script_file: Snippets of code contained in the repository.
  • homepage: URL of the item.
  • identifier: Identifiers detected within a repository (e.g., Digital Object Identifier).
  • images: Images used to illustrate the software component.
  • installation: A set of instructions that indicate how to install a target repository
  • invocation: Execution command(s) needed to run a scientific software component
  • issue_tracker: Link where to open issues for the target repository
  • keywords: set of terms used to commonly identify a software component
  • license: License and usage terms of a software component
  • logo: Main logo used to represent the target software component.
  • maintainer': Individuals or teams responsible for maintaining the software component, extracted from the CODEOWNERS file
  • name: Name identifying a software component
  • ontologies: URL and path to the ontology files present in the repository.
  • owner: Name of the user or organization in charge of the repository
  • package_distribution: Link to official package repositories where the software can be downloaded from (e.g., pypi).
  • package_file: Link to a package file used in the repository (e.g., pyproject.toml, setup.py).
  • package_id: Identifier extracted from packages. (e.g., packages.json)
  • programming_languages: Languages used in the repository.
  • readme_url: URL to the main README file in the repository.
  • related_papers: URL to possible related papers within the repository stated within the readme file.
  • releases: Pointer to the available versions of a software component.
  • repository_status: Repository status as it is described in repostatus.org.
  • requirements: Pre-requisites and dependencies needed to execute a software component.
  • run: Running instructions of a software component. It may be wider than the invocation category, as it may include several steps and explanations.
  • runtime_platform: Specifies the runtime environment or script interpreter dependencies required to run the project (e.g., Python, Java, Julia).
  • stargazers_count: Total number of stargazers of the project.
  • support: Guidelines and links of where to obtain support for a software component.
  • support_channels: Help channels one can use to get support about the target software component.
  • usage: Usage examples and considerations of a code repository.
  • workflows: URL and path to the computational workflow files present in the repository.

The following table summarized the properties used to describe a category:

Property Mandatory? Expected value Definition
confidence Yes Number Value ranging from 0 (very low) to 1 (very high) indicating the confidence of the program in the quality of the extraction.
result Yes Result Result obtained from the extraction in a code repository
source No Url URL of the source file used for the extraction.
technique Yes String Technique used for the extraction. One of the following list: Supervised classification, header analysis, regular expression, GitHub API, File exploration, Code parsing

Result

Field returning the extracted output from the code repository. An example can be seen below for a citation found in BibteX format in a README file of a code repository:

"citation": [
  {
    "result": {
      "value": "@inproceedings{ilievski2020kgtk,\n  title={{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}},\n  author={Ilievski, Filip and Garijo, Daniel and Chalupsky, Hans and Divvala, Naren Teja and Yao, Yixiang and Rogers, Craig and Li, Ronpeng and Liu, Jun and Singh, Amandeep and Schwabe, Daniel and Szekely, Pedro},\n  booktitle={International Semantic Web Conference},\n  pages={278--293},\n  year={2020},\n  organization={Springer}\n  url={https://arxiv.org/pdf/2006.00088.pdf}\n}",
      "format": "bibtex",
      "type": "string",
      "title": "{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis",
      "url": "https://arxiv.org/pdf/2006.00088.pdf",
      "original_header": "citation"
    },
    "confidence": 1.0,
    "technique": "Regular expression",
    "source": "<url to README file>"
  }
]

A result may have the following fields:

Property Mandatory? Expected value Definition
format No String Format in which the value is returned. For example, it may be a Dockerfile, a jupyter notebook, or a citation in BibteX.
type Yes String Text indicating the value type of the result. In some cases it refers to the type of literal being returned, while in others it refers to the type of the object. For example, a license may be detected as a URL, as a text excerpt detected from a file, or as an object with both name and url.
value Yes String, Number, Date or Url Text with the result of the extraction performed by SOMEF. The value is always a single object, not a list.

Depending on the type of the result, additional properties may be found.

The following object types are currently supported:

  • Release: software releases of the current code repository, as available from GitHub.
  • Programming_language: Programming language used in the repository.
  • License: object representing all the metadata SOMEF extracts from a license.
  • Agent: user (typically, a person) or organization responsible for authoring a software release or a paper.
  • ScholarlyArticle: Scientific paper or article associated with the code repository.
  • SoftwareApplication: Class to represent the main software component metadata.
  • SoftwareDependency: Class to represent software dependencies and runtime platforms required to run the project.

The following literal types are currently supported:

  • Number: A numerical value. We do not distinguish between integer, long or float.
  • Date: Dates in xsd:date format.
  • String: Any representation in text that is not considered a number, date or url. There are two special types of strings.
  • Text_excerpt: The value is a string that has been extracted from a file.
  • File_dump: The value is a string with the contents of a file (e.g., a citation.cff file, or a license.md file).
  • Url: uniform resource locator of a file.

The tables below summarizes all types and their corresponding properties. The following object types are currently supported (aligned with Schema.org and CodeMeta vocabularies)

An Agent has the following properties:

Property Expected value Definition
affiliation String name of organization or affiliation
email String Email of an author
family_name String Last name of an author
given_name String First name of an author
identifier String id of an agent
name String Name used to designate the person or organization
role String The role of the agent in the development or maintenance of this software component
url Url Uniform resource locator of the resource

An Asset has the following properties:

Property Expected value Definition
content_size Integer size of file
content_url String direct download link for the release file
download_count Integer numbers of downloads
encoding_format String format of the file
name String Title or name of the file
upload_date Date Date of creation of a release
url Url Uniform resource locator of the resource

A License has the following properties:

Property Expected value Definition
identifier String id of licence
name String Title or name of the license
spdx_id String Spdx id corresponding to this license
url Url Uniform resource locator of the license

A Programming_language has the following properties:

Property Expected value Definition
name String Name of the language
size Integer File size content (bytes) of a code repository using a given programming language

A Release has the following properties:

Property Expected value Definition
assets Asset Files attached to the release
author Agent, Organization Person or organization responsible for creating an article or a software release.
description String Descriptive text with the purpose of the release
date_created Date Date of creation of a release
date_published Date Date of publication of a release
html_url Url link to the HTML representation of a release
name String Title or name used to designate the release, license user or programming language.
release_id String Id of a software release.
tag String named version of a release
tarball_url Url URL to the tar ball file where to download a software release
url Url Uniform resource locator of the resource
zipball_url Url URL to the zip file where to download a software release

A ScholarlyArticle has the following properties:

Property Expected value Definition
author List of Agent List of authors responsible for the publication, providing structured metadata for each
date_published String Date when the article or citation was officially published.
doi String Digital Object Identifier (DOI) of the reference, usually returned as a full URL.
journal String Journal where the publication appeared
pages String Page range of the publication
title String Title of reference or citation
url String Link to reference or citation
value String Title of reference or citation
year Number Year of publication

A SoftwareApplication or SoftwareDependency has the following properties:

Property Expected value Definition
dependency_type String Indicates the scope of the dependency: development, runtime or documentation.
dependency_resolver String Identifies the ecosystem or package manager that resolves the dependency (e.g., npm, pip, julia, conda).
is_preferred_citation Boolean Set to True if the authors explicitly state this is the preferred citation. Omitted otherwise.
name String Name of the software, dependency, or runtime platform (e.g., "pandas", "python").
type String The object type: SoftwareApplication (for the main repository) or SoftwareDependency (for requirements and platforms).
value String A string representation typically combining name and version.
version String The version or version range of the software/dependency.

A Text_excerpt has the following properties:

Property Expected value Definition
original_header String If the result value is extracted from a markdown file like a README, the original header of that section is also returned.
parent_header [String] If the result value is extracted from a markdown file like a README, the parent header(s) of the current section are also returned (in case they exist).

Format

The following formats for a result value are currently recognized:

  • bibtex: format typically used to document bibliography in LateX projects.
  • cff: Citation file format, an increasingly popular format for citing software projects.
  • jupyter_notebook: computational notebooks typically used in data science.
  • dockerfile: Docker files used to build Docker images.
  • docker_compose: orchestration file used to communicate multiple containers.
  • readthedocs: documentation format used by many repositories in order to describe their projects.
  • wiki: documentation format used in GitHub repositories.
  • setup.py: package file format used in python projects.
  • publiccode.yml: metadata file used to describe public sector software projects.
  • pyproject.toml: package file format used in python projects.
  • pom.xml: package file used in Java projects.
  • package.json: package file used in Javascript projects.
  • bower.json: package descriptor used for configuring packages that can be used as a dependency for Bower-managed front-end projects.
  • composer.json: manifest file serves as the package descriptor used in PHP projects.
  • cargo.toml.json: manifest file serves as the package descriptor used in Rust projects.
  • [name].gemspec:manifest file serves as the package descriptor used in Ruby gem projects.

Technique

The techniques can be of several types:

  • code_parser: the result was obtained from parsing package files with metadata.
  • header_analysis: the result was extracted by analyzing the headers used in the README file and assessing their proximity to commonly used headers (and other synonims).
  • file_exploration: the result comes from an exploration of the files in the repository
  • GitHub_API: the result was obtained from the GitHub API.
  • GitLab_API: the result was obtained from the GitLab API.
  • regular_expression: the result was obtained after performing regular expressions on the files in the repository.
  • software_type_heuristics: the result was obtained from analysis of the repository based on various heuristics from the README, code and extension analysis.
  • supervised_classification: the results were obtained after running text classifiers trained for detecting that type of header.

Missing categories

If SOMEF is run with the -m flag, a report of the categories that the program was not able to find is returned. The format for this field is slightly different than the rest, providing a list of the missing categories. An example can be seen below:

"somef_missing_categories": [
  "description", 
  "citation"
]

In this case, SOMEF was not able to find a description or a citation in the target repository. Missing categories will not be added in Codemeta and Turtle exports (in construction). Note that the prefix somef is added in the field, to indicate that this is a special type of category.

Citation Reconciliation

SOMEF reconciles extracted citations by matching unique identifiers (such as DOIs or URLs) to ensure a single, non-duplicated list of references. During this process, each entry retains its provenance, including the original source and the extraction technique used, ensuring full metadata transparency. Once reconciled, the Codemeta export module distinguishes between scholarly articles and the software itself, mapping them respectively to referencePublication and creditText in the final output.

For a detailed breakdown of how these categories are structured and exported, please refer to the Codemeta format section.

Codemeta format

JSON-LD representation following the Codemeta specification (which itself extends Schema.org). The result, provenance and confidence fields are ommitted in this representation (every category with confidence above the threshold specified when running SOMEF will be included in the results). In addition, any metadata category outside from what is defined in Codemeta will be avoided.

In the CodeMeta export, these categories are mapped to their respective fields, ensuring compliance with scientific metadata standards.

The table below summarizes the mapping between the SOMEF internal JSON structure and the Codemeta/Schema.org output:

Codemeta / Schema.org Field SOMEF Category Description
author author Principal authors
buildInstructions installation / documentation Installation or build instructions
creditText citation (Software) Human-readable citation for the software 1
codeRepository code_repository Source code repository URL
contributor contributor Project contributors
copyrightHolder copyright_holder Entity holding the copyright
copyrightYear copyright_holder Year of copyright
dateCreated date_created Creation date
dateModified date_modified Last modification date
datePublished date_published Initial publication date
description description Project description
developmentStatus repository_status Current development status
downloadUrl download_url Link to download the software
funder funding Organization funding the project
funding funding Funding details/grants
identifier identifier Unique identifiers (e.g., DOI, SWHID)
issueTracker code_repository / issue_tracker Issue tracker URL (derived from codeRepository) or issue_tracker
keywords keywords Project tags or keywords
license license Software license
logo logo Project logo URL
maintainer maintainer Project maintainers
name name Software name
programmingLanguage programming_languages Languages used
readme readme_url README file URL
referencePublication citation (Papers)
releaseNotes releases Release notes extracted from each release description
runtimePlatform runtime_platform
softwareRequirements requirements Technical dependencies and requirements extracted from config files or text
softwareVersion releases Current version extracted from version of releases
url homepage Project homepage URL

1 Handling Citations: creditText vs. referencePublication

SOMEF reconciles extracted citations by distinguishing between the software itself and its associated scholarly articles. To ensure compliance with the Codemeta specification, the citation category from the internal JSON is processed and mapped to two distinct fields in the output:

  • creditText: This field is intended for the software itself, providing a ready-to-use citation string for users. It includes authors, the software title, and a direct link to the repository (added during the export process). SOMEF populates this when it detects software-related citations (e.g., from CITATION.cff or root metadata).
  • referencePublication: This field is intended for scholarly articles (including preferred-citations and general papers). These are mapped as ScholarlyArticle objects containing structured metadata such as DOI, journal, and title. SOMEF populates this field when it identifies citations based on specific indicators like DOI, journal, or explicit ScholarlyArticle classification.

This transformation ensures that the final Codemeta JSON-LD remains clean, compliant, and clearly separates the software project from its associated academic literature.

Example: ``` "creditText": [ "Garijo, D., Mao, A., Dharmala, H., Diwanji, C., Wang, J., et al. (somef: software metadata extraction framework). Available at: https://github.com/KnowledgeCaptureAndDiscovery/somef" ],

```