Skip to content

Output

SOMEF supports three main output formats. Each of them contains different information with different levels of granularity. Below we enumerate them from more granular to less granular:

JSON format

Version: 1.0.0

Default SOMEF response (and more complete in terms of metadata). The JSON format returns a set of categories, as shown in the snippet below:

{
  "<categoryName>": [
    {
      ...
    }
  ],
  "<categoryName2>": [
    {
      ...
    }
  ],
  "somef_provenance":{    
    "date": "2022-05-20 12:00:00",
    "somef_version": "0.9.1", 
    "somef_schema_version":"1.0.0"
  }
}

In the snippet, each <categoryName> corresponds to the different categories SOMEF was able to find. An additional JSON field called somef_provenance returns provenance information of the SOMEF execution. The somef_provenance field always has the same two properties, as shown in the table below:

Property Mandatory? Expected value Definition
date Yes Date Date when the extraction was performed. Knowing the date is critical, as a repository may change its README file.
somef_version Yes String Version of SOMEF used to extract metadata from a code repository.
somef_schema_version Yes String Version of SOMEF schema used to represent the JSON output format.

Info

If a property is mandatory then it will always be returned in the output JSON.

Category

Each extracted metadata category is returned as a list, which contains the number of results SOMEF found about that category when exploring a code repository. For example, the snippet below shows a repository with two descriptions (a short one extracted from the GitHub API and a longer one extracted from the README file). SOMEF aims to return both of them:

"description": [
  {
    "result": {
        "value": "KGTK is a Python library ...",
        "type": "text"
    },
    "confidence": 0.8294290479925978,
    "technique": "Supervised classification",
    "source": "<url to readme file>"
  },
  {
    "result": {
        "value": "Python library for large KG manipulation",
        "type": "text"
    },
    "confidence": 1,
    "technique": "GitHub API"
  }
]

For each element of the list, SOMEF returns a result object, together with its confidence value, the technique used in the extraction and the source file where it's coming from (in case there is one). For example, in the snippet above, SOMEF extracted a description from the README file of the repository using supervised classification, and a short description using the GitHub API.

The confidence depends on the technique used. In this case, the confidence is driven by the classifier which makes the prediction. For the GitHub API the confidence is higher, as it was a description added manually by the authors.

SOMEF aims to recognize the following categories (in alphabetical order):

  • application_domain: The application domain of the repository. This may be related to the research area of a software component (e.g., Astrophysics) or the general domain/functionality of the tool (i.e., machine learning projects). See all current recognized application domains here.
  • acknowledgement: Any text that the authors have prepared to acknnowledge the contribution from others, or project funding.
  • contributors: Contributors to a software component
  • contributing guidelines: Guidelines indicating how to contribute to a software component.
  • citation: Software citation (usually in .bib form) as the authors have stated in their readme file, or through a CFF file.
  • code_of_conduct: Link to the code of conduct file of the project
  • code_repository: Link to the source code (typically the repository where the readme can be found)
  • contact: Contact person responsible for maintaining a software component.
  • date_created: Date when the software component was created.
  • date_updated: Date when the software component was last updated (note that this will always be older than the date of the extraction).
  • description: A description of what the software component does.
  • documentation: Where to find additional documentation about a software component.
  • download: Download instructions included in the repository.
  • download_url: URL where to download the target software (typically the installer, package or a tarball to a stable version)
  • executable_example: Jupyter notebooks ready for execution (e.g., through myBinder, colab or files)
  • faq: Frequently asked questions about a software component
  • forks_count: Number of forks of the project at the time of the extraction.
  • forks_url: Links to forks made of the project (GitHub only)
  • full_name: Name + owner (owner/name) (if available)
  • full_title: If the repository has a short name, we will attempt to extract the longer version of the repository name. For example, a repository may be called "Widoco", but the longer title is "Wizard for documenting ontologies".
  • has_build_file: Build file to create a Docker image for the target software
  • has_script_file: Snippets of code contained in the repository.
  • identifier: Identifiers detected within a repository (e.g., Digital Object Identifier).
  • images: Images used to illustrate the software component.
  • installation: A set of instructions that indicate how to install a target repository
  • invocation: Execution command(s) needed to run a scientific software component
  • issue_tracker: Link where to open issues for the target repository
  • keywords: set of terms used to commonly identify a software component
  • license: License and usage terms of a software component
  • logo: Main logo used to represent the target software component.
  • name: Name identifying a software component
  • ontologies: URL and path to the ontology files present in the repository.
  • owner: Name of the user or organization in charge of the repository
  • package_distribution: Link to official package repositories where the software can be downloaded from (e.g., pypi).
  • programming_languages: Languages used in the repository.
  • readme_url: URL to the main README file in the repository.
  • related_documentation: Pointers to documentation of related projects which may be needed when using the target repository.
  • related_papers: URL to possible related papers within the repository stated within the readme file.
  • releases: Pointer to the available versions of a software component.
  • repository_status: Repository status as it is described in repostatus.org.
  • requirements: Pre-requisites and dependencies needed to execute a software component.
  • run: Running instructions of a software component. It may be wider than the invocation category, as it may include several steps and explanations.
  • stargazers_count: Total number of stargazers of the project.
  • support: Guidelines and links of where to obtain support for a software component.
  • support_channels: Help channels one can use to get support about the target software component.
  • usage: Usage examples and considerations of a code repository.
  • workflows: URL and path to the workflow files present in the repository.
  • type: Software type: Commandline Application, Notebook Application, Ontology, Workflow. Non-Software types: Static Website, Uncategorized

The following table summarized the properties used to describe a category:

Property Mandatory? Expected value Definition
confidence Yes Number Value ranging from 0 (very low) to 1 (very high) indicating the confidence of the program in the quality of the extraction.
result Yes Result Result obtained from the extraction in a code repository
source No Url URL of the source file used for the extraction.
technique Yes String Technique used for the extraction. One of the following list: Supervised classification, header analysis, regular expression, GitHub API, File exploration, Code parsing

Result

Field returning the extracted output from the code repository. An example can be seen below for a citation found in BibteX format in a README file of a code repository:

"citation": [
  {
    "result": {
      "value": "@inproceedings{ilievski2020kgtk,\n  title={{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}},\n  author={Ilievski, Filip and Garijo, Daniel and Chalupsky, Hans and Divvala, Naren Teja and Yao, Yixiang and Rogers, Craig and Li, Ronpeng and Liu, Jun and Singh, Amandeep and Schwabe, Daniel and Szekely, Pedro},\n  booktitle={International Semantic Web Conference},\n  pages={278--293},\n  year={2020},\n  organization={Springer}\n  url={https://arxiv.org/pdf/2006.00088.pdf}\n}",
      "format": "bibtex",
      "type": "string",
      "url": "https://arxiv.org/pdf/2006.00088.pdf"
    },
    "confidence": 1.0,
    "technique": "Regular expression",
    "source": "<url to README file>"
  }
]

A result may have the following fields:

Property Mandatory? Expected value Definition
format No String Format in which the value is returned. For example, it may be a Dockerfile, a jupyter notebook, or a citation in BibteX.
type Yes String Text indicating the value type of the result. In some cases it refers to the type of literal being returned, while in others it refers to the type of the object. For example, a license may be detected as a URL, as a text excerpt detected from a file, or as an object with both name and url.
value Yes String, Number, Date or Url Text with the result of the extraction performed by SOMEF. The value is always a single object, not a list.

Depending on the type of the result, additional properties may be found.

The following object types are currently supported:

  • Release: software releases of the current code repository, as available from GitHub.
  • Programming_language: Programming language used in the repository.
  • License: object representing all the metadata SOMEF extracts from a license.
  • Agent: user (typically, a person) or organization responsible for authoring a software release or a paper.
  • Publication: Scientific paper associated with the code repository.

The following literal types are currently supported:

  • Number: A numerical value. We do not distinguish between integer, long or float.
  • Date: Dates in xsd:date format.
  • String: Any representation in text that is not considered a number, date or url. There are two special types of strings.
  • Text_excerpt: The value is a string that has been extracted from a file.
  • File_dump: The value is a string with the contents of a file (e.g., a citation.cff file, or a license.md file).
  • Url: uniform resource locator of a file.

The table below summarizes all types and their corresponding properties:

Property Describes Expected value Definition
author Release, Publication Agent, Organization Person or organization responsible for creating an article or a software release.
doi Publication Url When a publication is detected, but the format is in bibtek or CFF, SOMEF will add a doi field with the detected DOI value. The result includes a full URL.
description Release String Descriptive text with the purpose of the release
date_created Release Date Date of creation of a release
date_published Release Date Date of publication of a release
html_url Release Url link to the HTML representation of a release
name License, Release, User, Programming_language String Title or name used to designate the release, license user or programming language.
original_header Text_excerpt String If the result value is extracted from a markdown file like a README, the original header of that section is also returned.
parent_header Text_excerpt [String] If the result value is extracted from a markdown file like a README, the parent header(s) of the current section are also returned (in case they exist).
release_id Release String Id of a software release.
size Programming_language Number File size content (bytes) of a code repository using a given programming language
spdx_id License String Spdx id corresponding to this license
tag Release String named version of a release
tarball_url Release Url URL to the tar ball file where to download a software release
title Publication String Title of the publication
url Release, Publication, License Url Uniform resource locator of the resource
zipball_url Release Url URL to the zip file where to download a software release

Format

The following formats for a result value are currently recognized:

  • bibtex: format typically used to document bibliography in LateX projects.
  • cff: Citation file format, an increasingly popular format for citing software projects.
  • jupyter_notebook: computational notebooks typically used in data science.
  • dockerfile: Docker files used to build Docker images.
  • docker_compose: orchestration file used to communicate multiple containers.
  • readthedocs: documentation format used by many repositories in order to describe their projects.
  • wiki: documentation format used in GitHub repositories.

Technique

The techniques can be of several types:

  • header_analysis: the result was extracted by analyzing the headers used in the README file and assessing their proximity to commonly used headers (and other synonims).
  • supervised_classification: the results were obtained after running text classifiers trained for detecting that type of header.
  • file_exploration: the result comes from an exploration of the files in the repository
  • GitHub_API: the result was obtained from the GitHub API.
  • GitLab_API: the result was obtained from the GitLab API.
  • regular_expression: the result was obtained after performing regular expressions on the files in the repository.
  • code_parser: the result was obtained from code configuration files with metadata markup.
  • software_type_heuristics: the result was obtained from analysis of the repository based on various heuristics from the README, code and extension analysis.

Missing categories

If SOMEF is run with the -m flag, a report of the categories that the program was not able to find is returned. The format for this field is slightly different than the rest, providing a list of the missing categories. An example can be seen below:

"somef_missing_categories": [
  "description", 
  "citation"
]

In this case, SOMEF was not able to find a description or a citation in the target repository. Missing categories will not be added in Codemeta and Turtle exports. Note that the prefix somef is added in the field, to indicate that this is a special type of category.

Turtle format

RDF represents information in triples (subject, predicate and object), where the subject is the entity to be described (in this case a code repository), the predicate is the property describing the subject (e.g., description) and the object is the value obtained (which can be an object or a literal). Representing the provenance information returned by SOMEF is therefore challenging, and would require adopting reification mechanisms that make the output more complex. Therefore we simplify the output by providing our best guess for each of the extracted fields. This is done by analyzing the results, comparing those that may be redundant or come from the same files, and removing those with low confidence or included in other fields.

Our RDF representation uses the Software Description Ontology. The result, provenance and confidence fields are ommitted in this representation (every category with confidence above the threshold specified when running SOMEF will be included in the results).

Below you can see an example of a software represented in sd:

@prefix ns1: <https://w3id.org/okn/o/sd#> .
@prefix ns2: <https://schema.org/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

## SOFTWARE METADATA

<https://w3id.org/okn/i/Software/mapeathor> a ns1:Software ;
    ns2:license <https://w3id.org/okn/i/License/mapeathor> ;
    ns1:contactDetails """- [Ana Iglesias-Molina](https://github.com/anaigmo) - [ana.iglesiasm@upm.es](mailto:ana.iglesiasm@upm.es)...
""" ;
    ns1:dateCreated "2019-06-09T19:45:24+00:00"^^xsd:dateTime ;
    ns1:dateModified "2022-11-03T15:19:23+00:00"^^xsd:dateTime ;
    ns1:description "Translator of spreadsheet mappings into R2RML, RML or YARRRML";
    ns1:hasBuildFile "https://raw.githubusercontent.com/oeg-upm/mapeathor/master/Dockerfile"^^xsd:anyURI,
    ns1:hasDocumentation "https://github.com/oeg-upm/Mapeathor/wiki"^^xsd:anyURI ;
    ns1:hasDownloadUrl "https://github.com/oeg-upm/mapeathor/releases"^^xsd:anyURI ;
    ns1:hasExecutableInstructions """The easiest way of running Mapeathor is using the [web service](https://morph.oeg.fi.upm.es/demo/mapeathor) and the [Swagger](https://morph.oeg.fi.upm.es/tool/mapeathor/swagger/) instance. For CLI lovers, the service is available as a [PyPi package](https://pypi.org/project/mapeathor/) and Docker image. The instructions of the latest can be found in the [wiki](https://github.com/oeg-upm/Mapeathor/wiki).
""" ;
    ns1:hasLongName "Mapeathor" ;
    ns1:hasSourceCode <https://w3id.org/okn/i/SoftwareSource/mapeathor> ;
    ns1:hasUsageNotes """##Example
A more detailed explanation is provided in the [wiki](https://github.com/oeg-upm/Mapeathor/wiki);
    ns1:hasVersion <https://w3id.org/okn/i/Release/21580066>;
    ns1:identifier "https://doi.org/10.5281/zenodo.5973906"^^xsd:anyURI ;
    ns1:issueTracker "https://api.github.com/repos/oeg-upm/mapeathor/issues"^^xsd:anyURI ;
    ns1:keywords "data-integration, knowledge-graph, r2rml, rml" ;
    ns1:name "oeg-upm/mapeathor" ;
    ns1:readme "https://raw.githubusercontent.com/oeg-upm/mapeathor/master/README.md"^^xsd:anyURI .

## LICENSE INFORMATION

<https://w3id.org/okn/i/License/mapeathor> a ns2:CreativeWork ;
    owl:sameAs <https://spdx.org/licenses/Apache-2.0> ;
    ns1:name "Apache License 2.0" ;
    ns1:url "https://raw.githubusercontent.com/oeg-upm/mapeathor/master/LICENSE"^^xsd:anyURI .

## INFORMATION ON RELEASES

<https://w3id.org/okn/i/Release/21580066> a ns1:SoftwareVersion ;
    ns1:author <https://w3id.org/okn/i/Agent/anaigmo> ;
    ns1:dateCreated "2019-11-08T15:24:55+00:00"^^xsd:dateTime ;
    ns1:datePublished "2019-11-19T10:26:47+00:00"^^xsd:dateTime ;
    ns1:downloadUrl "https://api.github.com/repos/oeg-upm/mapeathor/tarball/v1.0"^^xsd:anyURI,
        "https://api.github.com/repos/oeg-upm/mapeathor/zipball/v1.0"^^xsd:anyURI,
        "https://github.com/oeg-upm/mapeathor/releases/tag/v1.0"^^xsd:anyURI ;
    ns1:hasVersionId "v1.0" ;
    ns1:name "First template" ;
    ns1:url "https://api.github.com/repos/oeg-upm/mapeathor/releases/21580066"^^xsd:anyURI .

## INFORMATION ON SOURCE CODE
<https://w3id.org/okn/i/SoftwareSource/mapeathor> a ns2:SoftwareSourceCode ;
    ns1:codeRepository "https://github.com/oeg-upm/mapeathor"^^xsd:anyURI ;
    ns1:name "oeg-upm/mapeathor" ;
    ns1:programmingLanguage "Dockerfile",
        "Python" .

## INFORMATION ON AUTHORS
<https://w3id.org/okn/i/Agent/anaigmo> a ns2:Person ;
    ns2:name "anaigmo" .

As shown in the Turtle snippet above, SOMEF represents the software as an entity, its relationship with each release (software version), the license found in the repository and the Person who owns it.

Codemeta format

JSON-LD representation following the Codemeta specification (which itself extends Schema.org). The result, provenance and confidence fields are ommitted in this representation (every category with confidence above the threshold specified when running SOMEF will be included in the results). In addition, any metadata category outside from what is defined in Codemeta will be avoided.