Output
SOMEF supports three main output formats. Each of them contains different information with different levels of granularity. Below we enumerate them from more granular to less granular:
JSON format¶
Version: 1.0.1
Default SOMEF response (and more complete in terms of metadata). The JSON format returns a set of categories, as shown in the snippet below:
{
"<categoryName>": [
{
...
}
],
"<categoryName2>": [
{
...
}
],
"somef_provenance":{
"date": "2022-05-20 12:00:00",
"somef_version": "0.9.1",
"somef_schema_version":"1.0.0"
}
}
In the snippet, each <categoryName> corresponds to the different categories SOMEF was able to find. An additional JSON field called somef_provenance returns provenance information of the SOMEF execution. The somef_provenance field always has the same two properties, as shown in the table below:
| Property | Mandatory? | Expected value | Definition |
|---|---|---|---|
| date | Yes | Date | Date when the extraction was performed. Knowing the date is critical, as a repository may change its README file. |
| somef_version | Yes | String | Version of SOMEF used to extract metadata from a code repository. |
| somef_schema_version | Yes | String | Version of SOMEF schema used to represent the JSON output format. |
Info
If a property is mandatory then it will always be returned in the output JSON.
Category¶
Each extracted metadata category is returned as a list, which contains the number of results SOMEF found about that category when exploring a code repository. For example, the snippet below shows a repository with two descriptions (a short one extracted from the GitHub API and a longer one extracted from the README file). SOMEF aims to return both of them:
"description": [
{
"result": {
"value": "KGTK is a Python library ...",
"type": "text"
},
"confidence": 0.8294290479925978,
"technique": "Supervised classification",
"source": "<url to readme file>"
},
{
"result": {
"value": "Python library for large KG manipulation",
"type": "text"
},
"confidence": 1,
"technique": "GitHub API"
}
]
For each element of the list, SOMEF returns a result object, together with its confidence value, the technique used in the extraction and the source file where it's coming from (in case there is one). For example, in the snippet above, SOMEF extracted a description from the README file of the repository using supervised classification, and a short description using the GitHub API.
The confidence depends on the technique used. In this case, the confidence is driven by the classifier which makes the prediction. For the GitHub API the confidence is higher, as it was a description added manually by the authors.
SOMEF aims to recognize the following categories (in alphabetical order):
acknowledgement: Any text that the authors have prepared to acknnowledge the contribution from others, or project funding.application_domain: The application domain of the repository. This may be related to the research area of a software component (e.g., Astrophysics) or the general domain/functionality of the tool (i.e., machine learning projects). See all current recognized application domains here.application_type: Software type: Commandline Application, Notebook Application, Ontology, Scientific Workflow. Non-Software types: Static Website, Uncategorizedauthor: Person(s) or organization(s) responsible of the project. This property is also used to indicate the responsible entities of a publication associated with the code repository.citation: Software citation (usually in .bib or .cff format). SOMEF extracts and structures the metadata from these files (including authors, titles, and DOIs) instead of just returning a raw string.code_of_conduct: Link to the code of conduct file of the projectcode_repository: Link to the source code (typically the repository where the readme can be found)contact: Contact person responsible for maintaining a software component.continuous_integration: Link to continuous integration service, supported on GitHub as well as in GitLab.contributing guidelines: Guidelines indicating how to contribute to a software component.contributor: Contributors to this software. Note: Contributor metadata is exported from metadata files (e.g., CodeMeta, CONTRIBUTORS, etc.) not from git logs.copyright_holder: Entity or individual owning the rights to the software. The year is also extracted, if available.date_created: Date when the software component was created.date_updated: Date when the software component was last updated (note that this will always be older than the date of the extraction).description: A description of what the software component does.development_status: The project’s development stage: beta, deprecated...documentation: Where to find additional documentation about a software component.download_url: URL where to download the target software (typically the installer, package or a tarball to a stable version)executable_example: Jupyter notebooks ready for execution (e.g., through myBinder, colab or files)faq: Frequently asked questions about a software componentforks_count: Number of forks of the project at the time of the extraction.forks_url: Links to forks made of the project (GitHub only)full_name: Name + owner (owner/name) (if available)full_title: If the repository has a short name, we will attempt to extract the longer version of the repository name. For example, a repository may be called "Widoco", but the longer title is "Wizard for documenting ontologies".funding: Funding code for the related project. Currently, this information is only extracted from existingcodemeta.jsonfiles within the repository.has_build_file: Build file to create a Docker image for the target softwarehas_package_file: Specifies what package file is present in the code repository.has_script_file: Snippets of code contained in the repository.homepage: URL of the item.identifier: Identifiers detected within a repository (e.g., Digital Object Identifier).images: Images used to illustrate the software component.installation: A set of instructions that indicate how to install a target repositoryinvocation: Execution command(s) needed to run a scientific software componentissue_tracker: Link where to open issues for the target repositorykeywords: set of terms used to commonly identify a software componentlicense: License and usage terms of a software componentlogo: Main logo used to represent the target software component.maintainer': Individuals or teams responsible for maintaining the software component, extracted from the CODEOWNERS filename: Name identifying a software componentontologies: URL and path to the ontology files present in the repository.owner: Name of the user or organization in charge of the repositorypackage_distribution: Link to official package repositories where the software can be downloaded from (e.g.,pypi).package_file: Link to a package file used in the repository (e.g.,pyproject.toml,setup.py).package_id: Identifier extracted from packages. (e.g.,packages.json)programming_languages: Languages used in the repository.readme_url: URL to the main README file in the repository.related_papers: URL to possible related papers within the repository stated within the readme file.releases: Pointer to the available versions of a software component.repository_status: Repository status as it is described in repostatus.org.requirements: Pre-requisites and dependencies needed to execute a software component.run: Running instructions of a software component. It may be wider than theinvocationcategory, as it may include several steps and explanations.runtime_platform: Specifies the runtime environment or script interpreter dependencies required to run the project (e.g., Python, Java, Julia).stargazers_count: Total number of stargazers of the project.support: Guidelines and links of where to obtain support for a software component.support_channels: Help channels one can use to get support about the target software component.usage: Usage examples and considerations of a code repository.workflows: URL and path to the computational workflow files present in the repository.
The following table summarized the properties used to describe a category:
| Property | Mandatory? | Expected value | Definition |
|---|---|---|---|
| confidence | Yes | Number | Value ranging from 0 (very low) to 1 (very high) indicating the confidence of the program in the quality of the extraction. |
| result | Yes | Result | Result obtained from the extraction in a code repository |
| source | No | Url | URL of the source file used for the extraction. |
| technique | Yes | String | Technique used for the extraction. One of the following list: Supervised classification, header analysis, regular expression, GitHub API, File exploration, Code parsing |
Result¶
Field returning the extracted output from the code repository. An example can be seen below for a citation found in BibteX format in a README file of a code repository:
"citation": [
{
"result": {
"value": "@inproceedings{ilievski2020kgtk,\n title={{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis}},\n author={Ilievski, Filip and Garijo, Daniel and Chalupsky, Hans and Divvala, Naren Teja and Yao, Yixiang and Rogers, Craig and Li, Ronpeng and Liu, Jun and Singh, Amandeep and Schwabe, Daniel and Szekely, Pedro},\n booktitle={International Semantic Web Conference},\n pages={278--293},\n year={2020},\n organization={Springer}\n url={https://arxiv.org/pdf/2006.00088.pdf}\n}",
"format": "bibtex",
"type": "string",
"title": "{KGTK}: A Toolkit for Large Knowledge Graph Manipulation and Analysis",
"url": "https://arxiv.org/pdf/2006.00088.pdf",
"original_header": "citation"
},
"confidence": 1.0,
"technique": "Regular expression",
"source": "<url to README file>"
}
]
A result may have the following fields:
| Property | Mandatory? | Expected value | Definition |
|---|---|---|---|
| format | No | String | Format in which the value is returned. For example, it may be a Dockerfile, a jupyter notebook, or a citation in BibteX. |
| type | Yes | String | Text indicating the value type of the result. In some cases it refers to the type of literal being returned, while in others it refers to the type of the object. For example, a license may be detected as a URL, as a text excerpt detected from a file, or as an object with both name and url. |
| value | Yes | String, Number, Date or Url | Text with the result of the extraction performed by SOMEF. The value is always a single object, not a list. |
Depending on the type of the result, additional properties may be found.
The following object types are currently supported:
Release: software releases of the current code repository, as available from GitHub.Programming_language: Programming language used in the repository.License: object representing all the metadata SOMEF extracts from a license.Agent: user (typically, a person) or organization responsible for authoring a software release or a paper.ScholarlyArticle: Scientific paper or article associated with the code repository.SoftwareApplication: Class to represent the main software component metadata.SoftwareDependency: Class to represent software dependencies and runtime platforms required to run the project.
The following literal types are currently supported:
Number: A numerical value. We do not distinguish between integer, long or float.Date: Dates in xsd:date format.String: Any representation in text that is not considered a number, date or url. There are two special types of strings.Text_excerpt: The value is a string that has been extracted from a file.File_dump: The value is a string with the contents of a file (e.g., acitation.cfffile, or alicense.mdfile).Url: uniform resource locator of a file.
The tables below summarizes all types and their corresponding properties. The following object types are currently supported (aligned with Schema.org and CodeMeta vocabularies)
An Agent has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| affiliation | String | name of organization or affiliation |
| String | Email of an author | |
| family_name | String | Last name of an author |
| given_name | String | First name of an author |
| identifier | String | id of an agent |
| name | String | Name used to designate the person or organization |
| role | String | The role of the agent in the development or maintenance of this software component |
| url | Url | Uniform resource locator of the resource |
An Asset has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| content_size | Integer | size of file |
| content_url | String | direct download link for the release file |
| download_count | Integer | numbers of downloads |
| encoding_format | String | format of the file |
| name | String | Title or name of the file |
| upload_date | Date | Date of creation of a release |
| url | Url | Uniform resource locator of the resource |
A License has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| identifier | String | id of licence |
| name | String | Title or name of the license |
| spdx_id | String | Spdx id corresponding to this license |
| url | Url | Uniform resource locator of the license |
A Programming_language has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| name | String | Name of the language |
| size | Integer | File size content (bytes) of a code repository using a given programming language |
A Release has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| assets | Asset | Files attached to the release |
| author | Agent, Organization | Person or organization responsible for creating an article or a software release. |
| description | String | Descriptive text with the purpose of the release |
| date_created | Date | Date of creation of a release |
| date_published | Date | Date of publication of a release |
| html_url | Url | link to the HTML representation of a release |
| name | String | Title or name used to designate the release, license user or programming language. |
| release_id | String | Id of a software release. |
| tag | String | named version of a release |
| tarball_url | Url | URL to the tar ball file where to download a software release |
| url | Url | Uniform resource locator of the resource |
| zipball_url | Url | URL to the zip file where to download a software release |
A ScholarlyArticle has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| author | List of Agent | List of authors responsible for the publication, providing structured metadata for each |
| date_published | String | Date when the article or citation was officially published. |
| doi | String | Digital Object Identifier (DOI) of the reference, usually returned as a full URL. |
| journal | String | Journal where the publication appeared |
| pages | String | Page range of the publication |
| title | String | Title of reference or citation |
| url | String | Link to reference or citation |
| value | String | Title of reference or citation |
| year | Number | Year of publication |
A SoftwareApplication or SoftwareDependency has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| dependency_type | String | Indicates the scope of the dependency: development, runtime or documentation. |
| dependency_resolver | String | Identifies the ecosystem or package manager that resolves the dependency (e.g., npm, pip, julia, conda). |
| is_preferred_citation | Boolean | Set to True if the authors explicitly state this is the preferred citation. Omitted otherwise. |
| name | String | Name of the software, dependency, or runtime platform (e.g., "pandas", "python"). |
| type | String | The object type: SoftwareApplication (for the main repository) or SoftwareDependency (for requirements and platforms). |
| value | String | A string representation typically combining name and version. |
| version | String | The version or version range of the software/dependency. |
A Text_excerpt has the following properties:
| Property | Expected value | Definition |
|---|---|---|
| original_header | String | If the result value is extracted from a markdown file like a README, the original header of that section is also returned. |
| parent_header | [String] | If the result value is extracted from a markdown file like a README, the parent header(s) of the current section are also returned (in case they exist). |
Format¶
The following formats for a result value are currently recognized:
bibtex: format typically used to document bibliography in LateX projects.cff: Citation file format, an increasingly popular format for citing software projects.jupyter_notebook: computational notebooks typically used in data science.dockerfile: Docker files used to build Docker images.docker_compose: orchestration file used to communicate multiple containers.readthedocs: documentation format used by many repositories in order to describe their projects.wiki: documentation format used in GitHub repositories.setup.py: package file format used in python projects.publiccode.yml: metadata file used to describe public sector software projects.pyproject.toml: package file format used in python projects.pom.xml: package file used in Java projects.package.json: package file used in Javascript projects.bower.json: package descriptor used for configuring packages that can be used as a dependency for Bower-managed front-end projects.composer.json: manifest file serves as the package descriptor used in PHP projects.cargo.toml.json: manifest file serves as the package descriptor used in Rust projects.[name].gemspec:manifest file serves as the package descriptor used in Ruby gem projects.
Technique¶
The techniques can be of several types:
code_parser: the result was obtained from parsing package files with metadata.header_analysis: the result was extracted by analyzing the headers used in the README file and assessing their proximity to commonly used headers (and other synonims).file_exploration: the result comes from an exploration of the files in the repositoryGitHub_API: the result was obtained from the GitHub API.GitLab_API: the result was obtained from the GitLab API.regular_expression: the result was obtained after performing regular expressions on the files in the repository.software_type_heuristics: the result was obtained from analysis of the repository based on various heuristics from the README, code and extension analysis.supervised_classification: the results were obtained after running text classifiers trained for detecting that type of header.
Missing categories¶
If SOMEF is run with the -m flag, a report of the categories that the program was not able to find is returned. The format for this field is slightly different than the rest, providing a list of the missing categories. An example can be seen below:
"somef_missing_categories": [
"description",
"citation"
]
In this case, SOMEF was not able to find a description or a citation in the target repository. Missing categories will not be added in Codemeta and Turtle exports (in construction). Note that the prefix somef is added in the field, to indicate that this is a special type of category.
Citation Reconciliation¶
SOMEF reconciles extracted citations by matching unique identifiers (such as DOIs or URLs) to ensure a single, non-duplicated list of references. During this process, each entry retains its provenance, including the original source and the extraction technique used, ensuring full metadata transparency. Once reconciled, the Codemeta export module distinguishes between scholarly articles and the software itself, mapping them respectively to referencePublication and creditText in the final output.
For a detailed breakdown of how these categories are structured and exported, please refer to the Codemeta format section.
Codemeta format¶
JSON-LD representation following the Codemeta specification (which itself extends Schema.org). The result, provenance and confidence fields are ommitted in this representation (every category with confidence above the threshold specified when running SOMEF will be included in the results). In addition, any metadata category outside from what is defined in Codemeta will be avoided.
In the CodeMeta export, these categories are mapped to their respective fields, ensuring compliance with scientific metadata standards.
The table below summarizes the mapping between the SOMEF internal JSON structure and the Codemeta/Schema.org output:
| Codemeta / Schema.org Field | SOMEF Category | Description |
|---|---|---|
author |
author |
Principal authors |
buildInstructions |
installation / documentation |
Installation or build instructions |
creditText |
citation (Software) |
Human-readable citation for the software 1 |
codeRepository |
code_repository |
Source code repository URL |
contributor |
contributor |
Project contributors |
copyrightHolder |
copyright_holder |
Entity holding the copyright |
copyrightYear |
copyright_holder |
Year of copyright |
dateCreated |
date_created |
Creation date |
dateModified |
date_modified |
Last modification date |
datePublished |
date_published |
Initial publication date |
description |
description |
Project description |
developmentStatus |
repository_status |
Current development status |
downloadUrl |
download_url |
Link to download the software |
funder |
funding |
Organization funding the project |
funding |
funding |
Funding details/grants |
identifier |
identifier |
Unique identifiers (e.g., DOI, SWHID) |
issueTracker |
code_repository / issue_tracker |
Issue tracker URL (derived from codeRepository) or issue_tracker |
keywords |
keywords |
Project tags or keywords |
license |
license |
Software license |
logo |
logo |
Project logo URL |
maintainer |
maintainer |
Project maintainers |
name |
name |
Software name |
programmingLanguage |
programming_languages |
Languages used |
readme |
readme_url |
README file URL |
referencePublication |
citation (Papers) |
|
releaseNotes |
releases |
Release notes extracted from each release description |
runtimePlatform |
runtime_platform |
|
softwareRequirements |
requirements |
Technical dependencies and requirements extracted from config files or text |
softwareVersion |
releases |
Current version extracted from version of releases |
url |
homepage |
Project homepage URL |
1 Handling Citations: creditText vs. referencePublication
SOMEF reconciles extracted citations by distinguishing between the software itself and its associated scholarly articles. To ensure compliance with the Codemeta specification, the citation category from the internal JSON is processed and mapped to two distinct fields in the output:
creditText: This field is intended for the software itself, providing a ready-to-use citation string for users. It includes authors, the software title, and a direct link to the repository (added during the export process). SOMEF populates this when it detects software-related citations (e.g., fromCITATION.cffor root metadata).referencePublication: This field is intended for scholarly articles (including preferred-citations and general papers). These are mapped asScholarlyArticleobjects containing structured metadata such as DOI, journal, and title. SOMEF populates this field when it identifies citations based on specific indicators likeDOI,journal, or explicitScholarlyArticleclassification.
This transformation ensures that the final Codemeta JSON-LD remains clean, compliant, and clearly separates the software project from its associated academic literature.
Example: ``` "creditText": [ "Garijo, D., Mao, A., Dharmala, H., Diwanji, C., Wang, J., et al. (somef: software metadata extraction framework). Available at: https://github.com/KnowledgeCaptureAndDiscovery/somef" ],
```