diff --git a/sections/2_academic_impact/use_of_code_in_research.qmd b/sections/2_academic_impact/use_of_code_in_research.qmd index 8476516..79638f1 100644 --- a/sections/2_academic_impact/use_of_code_in_research.qmd +++ b/sections/2_academic_impact/use_of_code_in_research.qmd @@ -37,6 +37,14 @@ Sometimes a distinction is made between "reuse" and "use", where "reuse" refers This indicator can be useful to provide a more comprehensive view of the impact of the contributions by researchers. Some researchers might be more involved in publishing, whereas others might be more involved in developing and maintaining research software (and possibly a myriad other activities). +### Connections to Reproducibility Indicators + +This indicator focuses on identifying and measuring the presence and contribution of code or software within research activities, providing insight into how these tools support the research process itself. In contrast, reproducibility-focused indicators such as [Reuse of Code in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/reuse_of_code_in_research.html) examine the extent to which code or software is adopted and utilized in subsequent studies, reflecting its broader applicability, reusability and role in reproducibility. Additionally, the [Impact of Open Code in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/impact_of_open_code_in_research.html) highlights the value of openly shared code or software in fostering transparency, collaboration, and validation across the scientific community. + +### Connections to Reproducibility Indicators + +This indicator focuses on identifying and measuring the presence and contribution of code or software within research activities, providing insight into how these tools support the research process itself. In contrast, reproducibility-focused indicators such as [Reuse of Code in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/reuse_of_code_in_research.html) examine the extent to which code or software is adopted and utilized in subsequent studies, reflecting its broader applicability, reusability and role in reproducibility. Additionally, the [Impact of Open Code in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/impact_of_open_code_in_research.html) highlights the value of openly shared code or software in fostering transparency, collaboration, and validation across the scientific community. + # Metrics Most research software is not properly indexed. There are initiatives to have research software properly indexed and identified, such as the [Research Software Directory,](https://research-software-directory.org/) but these are far from comprehensive at the moment, and is the topic of ongoing research [@malviya-thakur_scicat_2023]. Many repositories support uploading research software. For instance, Zenodo currently holds about 116,000 records of research software. However, there are also reports of the absence of support for including research software in repositories [@carlin2023]. @@ -65,8 +73,11 @@ Not all bibliometric databases actively track research software, and therefore n Especially because of the limited explicit references to software, it is important to also explore other possibilities to track the use of code in research. One possibility is to try to extract the mentions of a software package or tool from the full-text. This is done by [@istrate] who have trained a machine learning model to extract references to software from full-text. They rely on the manual annotation of software mentions in PDFs by [@du2021]. The resulting dataset of software mentions is made available publicly [@istrate_cz_2022]. +The SciNoBo toolkit [@gialitsis2022b; @kotitsas2023b] has a new component, currently undergoing evaluation, which involves an automated tool, leveraging Deep Learning and Natural Language Processing techniques to identify code/software mentioned in the text of publications and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the code/software has been reused by the authors of the publication. + Although the dataset of software mentions might provide a useful resource, it is a static dataset, and at the moment, there do not yet seem to be initiative to continuously monitor and scan the full-text of publications. Additionally, its coverage is limited to mostly biomedical literature. For that reason, it might be necessary to run the proposed machine learning algorithm itself. The code is available from . + A common "gold standard" dataset for training software mention extraction from full text is the so-called SoftCite dataset [@howison_softcite_2023]. ## Repository statistics (# Forks/Clones/Stars/Downloads/Views) @@ -77,7 +88,9 @@ There are some clear limitations to this approach. Firstly, not all research sof The most common version control system at the moment is [Git](https://git-scm.com/), which itself is open-source. There are other version control systems, such as Subversion or Mercurial, but these are less popular. The most common platform on which Git repositories are shared is GitHub, which is not open-source itself. There are also other repository platforms, such as [CodeBerg](https://codeberg.org/) (built on [Forgejo](https://forgejo.org/)) and [GitLab](https://gitlab.com/), which are themselves open-source, but they have not yet managed to reach the popularity of GitHub. We therefore limit ourselves to describing GitHub, although we might extend this in the future. -### Measurement +To ensure that a repository primarily contains code and not data or datasets, one can consider the following checks: - Repository labelling: Look for repositories that are explicitly labelled as containing code or software. Many repository owners provide clear labels or descriptions indicating the nature of the content. - File extensions: Check for files with common code file extensions, such as .py, .java, or .cpp. These file extensions are commonly used for code files, while data files often have extensions like .csv, .txt, or .xlsx. - Repository descriptions and README files: Examine the repository descriptions and README files to gain insights into the content. Authors often provide information about the type of code included, its functionality, and its relevance to the project or software. - Documentation: Some repositories include extensive documentation that provides details on the software, its usage, and how to contribute to the project. This indicates a greater likelihood that the repository primarily contains code. - Existence of script and source folders: In some cases, the existence of certain directories like '/src' for source files or '/scripts' for scripts can indicate that the repository is primarily for code. + +#### Measurement We propose three concrete metrics based on the GitHub API: the number of forks, the number of stars and the number of downloads of releases. There are additional metrics about traffic available from [GitHub API metrics](https://docs.github.com/en/rest/metrics), but these unfortunately require permissions from a specific repository.\ diff --git a/sections/2_academic_impact/use_of_data_in_research.qmd b/sections/2_academic_impact/use_of_data_in_research.qmd index 21f88af..0580fef 100644 --- a/sections/2_academic_impact/use_of_data_in_research.qmd +++ b/sections/2_academic_impact/use_of_data_in_research.qmd @@ -37,6 +37,14 @@ Sometimes a distinction is made between "reuse" and "use", where "reuse" refers Nevertheless, this document attempts to summarize what indicators can be used to approximate data use in research. +### Connections to Reproducibility Indicators + +This indicator focuses on identifying and measuring how data is utilized in research activities, providing insight into its contribution to academic outputs and innovation. In contrast, the [Reuse of Data in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/reuse_of_data_in_research.html) examines the extent to which existing datasets are adopted for subsequent studies, emphasizing reusability and reproducibility. Additionally, the [Impact of Open Data in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/impact_of_open_data_in_research.html) highlights the broader effects of openly sharing data, fostering transparency, and driving advancements across scientific communities. + +### Connections to Reproducibility Indicators + +This indicator focuses on identifying and measuring how data is utilized in research activities, providing insight into its contribution to academic outputs and innovation. In contrast, the [Reuse of Data in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/reuse_of_data_in_research.html) examines the extent to which existing datasets are adopted for subsequent studies, emphasizing reusability and reproducibility. Additionally, the [Impact of Open Data in Research](https://handbook.pathos-project.eu/indicator_templates/quarto/5_reproducibility/impact_of_open_data_in_research.html) highlights the broader effects of openly sharing data, fostering transparency, and driving advancements across scientific communities. + # Metrics ## Number (Avg.) of times data is cited/mentioned in publications @@ -61,7 +69,9 @@ Based on the data citation information from data repositories one can compile a [UsageCounts](https://usagecounts.openaire.eu/about) for data use by OpenAIRE aims to monitor and report how often research datasets hosted within OpenAIRE are accessed, downloaded, or used by the scholarly community. The service tracks various metrics related to data use in research among which are statistics on data views and downloads. -Additionally, the \[``` datastet``](https://github.com/kermitt2/datastet) can be used to find named and implicit research datasets from within the academic literature. DataStet extends from [ ```dataseer-ml`](https://github.com/dataseer/dataseer-ml) to identify implicit and explicit dataset mentions in scientific documents, with DataSeer also contributing back to`datastet\`. It automatically characterizes dataset mentions as used or created in the research work. The identified datasets are classified based on a hierarchy derived from MeSH. It can process various scientific article formats such as PDF, TEI, JATS/NLM, ScholarOne, BMJ, Elsevier staging format, OUP, PNAS, RSC, Sage, Wiley, etc. Docker is recommended to deploy and run the DataStet service. In the aforementioned link instructions are provided for pulling the Docker image and running the service as a container. +Additionally, the [`datastet`](https://github.com/kermitt2/datastet) can be used to find named and implicit research datasets from within the academic literature. DataStet extends from [`dataseer-ml`](https://github.com/dataseer/dataseer-ml) to identify implicit and explicit dataset mentions in scientific documents, with DataSeer also contributing back to `datastet`. It automatically characterizes dataset mentions as used or created in the research work. The identified datasets are classified based on a hierarchy derived from MeSH. It can process various scientific article formats such as PDF, TEI, JATS/NLM, ScholarOne, BMJ, Elsevier staging format, OUP, PNAS, RSC, Sage, Wiley, etc. Docker is recommended to deploy and run the DataStet service. In the aforementioned link instructions are provided for pulling the Docker image and running the service as a container. + +The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] has a new component, currently undergoing evaluation, which involves an automated tool, leveraging Deep Learning and Natural Language Processing techniques to identify datasets mentioned in the text of publications and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the dataset has been reused by the authors of the publication. ##### Science resources diff --git a/sections/5_reproducibility/impact_of_open_code_in_research.qmd b/sections/5_reproducibility/impact_of_open_code_in_research.qmd index 1bcf79c..af92594 100644 --- a/sections/5_reproducibility/impact_of_open_code_in_research.qmd +++ b/sections/5_reproducibility/impact_of_open_code_in_research.qmd @@ -34,140 +34,28 @@ The impact of Open Code in research aims to capture the effect of making researc This indicator can be used to assess the level of openness and accessibility of research code within a specific scientific community or field and to identify potential barriers or incentives for the adoption of Open Code practices. It can also be used to track the reuse and subsequent impact related to reproducibility of Open Code, as well as to evaluate the effectiveness of policies and initiatives promoting Open Code practices. -# Metrics - -## NCI for publications that have introduced Open Code - -This metric calculates the Normalised Citation Impact (NCI) for publications that have introduced Open Code. By introducing Open Code, researchers enable others to scrutinize and build upon their computational methods, thus enhancing the potential for reproducibility and advancement of the field. The NCI metric primarily measures the citation impact of a publication, adjusted for differences in citation practices across scientific fields. However, citation impact can also be an indicator of research quality and reproducibility. Therefore, the NCI for publications that have introduced Open Code can serve as an indicator of both the visibility, influence, and reproducibility of research findings. - -One limitation of this metric is that the use of NCI has been criticized for its potential biases and limitations, such as the inability to fully account for differences in research quality or the influence of non-citation-based impact measures. Therefore, we recommend using this metric in conjunction with other metrics in this document, such as software mentions and citations of the code repository, to obtain a more comprehensive assessment of the impact of Open Code practices on research output. - -### Measurement - -To measure this metric, the process begins with the identification of publications that have introduced Open Code. This is typically achieved by scrutinizing metadata within the code repositories and the publications, such as the repository's unique identifiers or the DOI (Digital Object Identifier). Alternatively, explicit mentions of the code repository, such as GitHub or GitLab URLs, within the publication text can be extracted to verify their openness. This can be performed manually or using automated tools. - -Upon identification of the relevant publications, it is crucial to categorize them into their respective disciplines. The assignment of disciplines is typically based on the journal where the paper is published, the author's academic department, or the thematic content of the paper. Several databases provide such categorizations, such as [OpenAIRE](https://explore.openaire.eu/fields-of-science), [Scopus](https://www.scopus.com) and [Web of Science](https://www.webofscience.com/wos/woscc/basic-search). - -Finally, the NCI score for each publication is calculated. The NCI measures the citation impact of a publication relative to the average for similar publications in the same discipline, publication year, and document type. It is computed by dividing the total number of citations the publication receives by the average number of citations received by all similar publications. - -One limitation of this approach is that not all Open Code may be registered in code repositories, making it challenging to identify all relevant publications. Additionally, the accuracy of the NCI score may be affected by the availability and quality of citation data in different scientific fields. Therefore, it is important to carefully consider the potential biases and limitations of the data sources and methodologies used to measure this metric. - -#### Datasources - -##### Scopus - -[Scopus](https://www.scopus.com) is a comprehensive expertly curated abstract and citation database that covers scientific journals, conference proceedings, and books across various disciplines. Scopus provides enriched metadata records of scientific articles, comprehensive author and institution profiles, citation counts, as well as calculation of the articles' NCI score using their API. - -One limitation of Scopus is that the calculation of NCI from Scopus only considers documents that are indexed in the Scopus database. This could lead to underestimation or overestimation of the NCI for some publications, depending on how these publications are cited in sources outside the Scopus database. - -#### Existing methodologies - -##### SciNoBo toolkit - -The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] can be used to classify scientific publications into specific fields of science, which can then be used to calculate their NCI score. The tool utilizes the citation-graph of a publication and its references to identify its discipline and assign it to a specific Field-of-Science (FoS) taxonomy. The classification system of publications is based on the structural properties of a publication and its citations and references organized in a multilayer network. - -Furthermore, a new component of the SciNoBo toolkit, currently undergoing evaluation, involves an automated tool that employs Deep Learning and Natural Language Processing techniques to identify code/software mentioned in the text of publications and extract metadata associated with them, such as name, version, license, URLs etc. This tool can also classify whether the code/software has been introduced by the authors of the publication. - -To measure the proposed metric, the tool can be used to identify relevant publications that have introduced code/software in conjunction with code repositories in GitHub, GitLab, or Bitbucket where the code/software is openly located and calculate their NCI score. - -## NCI for publications that have (re)used Open Code - -This metric calculates the Normalised Citation Impact (NCI) for publications that have (re)used Open Code. It is a measure of the citation impact of research publications that have utilized Open Code, adjusted for differences in citation practices across scientific fields. The NCI for publications that have (re)used Open Code can indicate the potential impact of code sharing and reuse practices on the visibility and influence of research findings. - -A limitation of this metric is that the use of NCI has been criticized for its potential biases and limitations, such as the inability to fully account for differences in research quality or the influence of non-citation-based impact measures. Therefore, we recommend to use this metric in conjunction with other metrics in this document, such as software mentions and citations of the code repository, to obtain a more comprehensive assessment of the impact of Open Code practices on research output. - -### Measurement - -To measure this metric, the process begins with the identification of publications that have (re)used Open Code. This is achieved by extracting explicit mentions of software/code mentions or code repositories, such as GitHub or GitLab URLs, within the publication text and then verifying their (re)use and openness. This can be performed manually or using automated tools. - -Upon identification of the relevant publications, it is crucial to categorize them into their respective disciplines. The assignment of disciplines is typically based on the journal where the paper is published, the author's academic department, or the thematic content of the paper. Several databases provide such categorizations, such as [OpenAIRE](https://explore.openaire.eu/fields-of-science), [Scopus](https://www.scopus.com) and [Web of Science](https://www.webofscience.com/wos/woscc/basic-search). - -Finally, the NCI (Normalised Citation Impact) score for each publication is calculated. The NCI measures the citation impact of a publication relative to the average for similar publications in the same discipline, publication year, and document type. It is computed by dividing the total number of citations the publication receives by the average number of citations received by all other similar publications. - -One potential limitation of this approach is that not all Open Code may be registered in code repositories, making it challenging to identify all relevant publications. Additionally, the accuracy of the NCI score may be affected by the availability and quality of citation data in different scientific fields. Therefore, it is important to carefully consider the potential biases and limitations of the data sources and methodologies used to measure this metric. +### Connections to Academic Indicators -#### Datasources +This indicator examines the broader effects of making code or software openly accessible, focusing on its role in fostering transparency, collaboration, and reproducibility across the scientific community. This builds upon the [Use of Code in Research](https://handbook.pathos-project.eu/indicator_templates/sections/2_academic_impact/use_of_code_in_research.html), which assesses the initial incorporation of code or software into research, and the [Reuse of Code in Research](https://handbook.pathos-project.eu/indicator_templates/sections/5_reproducibility/reuse_of_code_in_research.html), which measures the extent to which existing code or software is adopted in subsequent studies. Together, these indicators provide a comprehensive view of how code or software contributes to research outputs, reusability, reproducibility, and the wider adoption of Open Code practices. -##### Scopus - -[Scopus](https://www.scopus.com) is a comprehensive expertly curated abstract and citation database that covers scientific journals, conference proceedings, and books across various disciplines. Scopus provides enriched metadata records of scientific articles, comprehensive author and institution profiles, citation counts, as well as calculation of the articles' NCI score using their API. +# Metrics -One limitation of Scopus is that the calculation of NCI from Scopus only considers documents that are indexed in the Scopus database. This could lead to underestimation or overestimation of the NCI for some publications, depending on how these publications are cited in sources outside the Scopus database. +## NCI for publications that have introduced/reused Open Code -#### Existing methodologies +This metric captures the Normalised Citation Impact (NCI) for publications that have either introduced or reused Open Code. By assessing citation impact, this indicator reflects the visibility and influence of research publications that contribute to or benefit from Open Code practices. Citation-based metrics, including the NCI, are extensively discussed under the academic indicator [Citation Impact](https://handbook.pathos-project.eu/sections/2_academic_impact/citation_impact.html). For general details on the methodology, limitations, and measurement of NCI, refer to the academic indicator and its corresponding metrics. -##### SciNoBo toolkit +In this metric, we focus specifically on publications that have directly contributed to reproducibility by either introducing new Open Code or reusing existing Open Code. The reuse of Open Code can be identified using methodologies and metrics outlined in the academic indicator [Use of Code in Research](https://handbook.pathos-project.eu/sections/2_academic_impact/use_of_code_in_research.html), which provides tools and techniques for tracking code usage in research publications. Additionally, this indicator highlights publications that explicitly document and share new Open Code repositories. -The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] can be used to classify scientific publications into specific fields of science, which can then be used to calculate their NCI score. The tool utilizes the citation-graph of a publication and its references to identify its discipline and assign it to a specific Field-of-Science (FoS) taxonomy. The classification system of publications is based on the structural properties of a publication and its citations and references organized in a multilayer network. +To measure the NCI for publications that have introduced Open Code, we identify relevant publications through metadata analysis of code repositories, such as unique identifiers or DOIs associated with repositories like GitHub or GitLab. This process can be supported by automated tools, including the SciNoBo toolkit, which uses Deep Learning and Natural Language Processing (NLP) to extract metadata such as the repository name, version, license, and URLs. These tools enable precise identification of publications introducing Open Code, making it possible to calculate their citation impact. -Furthermore, a new component of the SciNoBo toolkit, currently undergoing evaluation, involves an automated tool that employs Deep Learning and Natural Language Processing techniques to identify code/software mentioned in the text of publications and extract metadata associated with them, such as name, version, license, URLs etc. This tool can also classify whether the code/software has been (re)used by the authors of the publication. +The NCI for publications that have introduced or reused Open Code is particularly relevant for reproducibility because it serves as a proxy for the level of engagement and trust that the broader research community places in these resources. Highly cited publications introducing Open Code often signal that the code has provided novel, generalizable solutions to scientific problems, enabling other researchers to replicate and extend findings. Similarly, publications with high NCI that reuse Open Code indicate that shared computational tools are not only accessible but also integral to advancing research in a transparent and reproducible manner. -To measure the proposed metric, the tool can be used to identify relevant publications that have (re)used code/software in conjunction with code repositories in GitHub, GitLab, or Bitbucket where the code/software is openly located and calculate their NCI score. +By focusing on NCI, we can compare publications across disciplines and timeframes, overcoming disparities in citation practices. This ensures that the contribution of Open Code to reproducibility is evaluated on a level playing field, highlighting those practices and outputs that have the greatest impact. Furthermore, normalized metrics allow us to monitor trends in Open Code practices, assess the effectiveness of reproducibility policies, and identify fields or communities where additional incentives for Open Code adoption might be needed. ## Code downloads/usage counts/stars from repositories -This metric measures the number of times an Open Code repository has been downloaded, used, or favourited, which can indicate the level of interest and impact of the code on the scientific community. - -In terms of reproducibility, high usage counts or stars may indicate that a code/software is well-documented and easy to use. Furthermore, a widely used code/software is more likely to be updated and maintained over time, which can improve its reproducibility. - -However, this metric may have limitations in capturing the impact of code that is not hosted in a public repository or downloaded through other means, such as direct communication between researchers. Additionally, usage counts and stars may not necessarily reflect the quality or impact of the code, and may be influenced by factors such as marketing and social media outreach. Therefore, we recommend using this metric in conjunction with other metrics in this document to obtain a more comprehensive assessment of the impact of Open Code practices on research output. - -### Measurement - -To measure this metric, data can be obtained from code repositories such as GitHub, GitLab, or Bitbucket. The number of downloads, usage counts, and stars can be extracted from the repository metadata. For example, on GitHub, this data is available through the API or by accessing the repository page. However, it is important to note that not all repository hosting providers may make this information publicly available, and some may only provide partial or incomplete usage data. - -Additionally, the accuracy of the usage data may be affected by factors such as the frequency of updates, the type of license, and the accessibility of the code to different research communities. - -The data can be computationally obtained using web scraping tools, API queries, or by manually accessing the download/usage count/star data. - -#### Datasources - -##### Github - -[GitHub](https://github.com/) is a web-based platform used for version control and collaborative software development. It allows users to create and host code repositories, including those for Open Source software and datasets. The number of downloads, usage counts, and stars on GitHub can be used as a metric for the impact and popularity of Open Code. - -To measure this metric, we can search for the relevant repositories on GitHub and extract the relevant download, usage, and star data. This data can be accessed via the GitHub API, which provides programmatic access to repository data. The API can be queried using HTTP requests, and the resulting data can be parsed and analysed using programming languages such as Python. - -Following is an API call example for retrieving the stars of the `indicator_handbook` repository for `PathOS-project` from Github. - -``` python -import requests -owner = "PathOS-project" -repo = "indicator_handbook" -url = f"https://api.github.com/repos/{owner}/{repo}/stargazers" -headers = {"Accept": "application/vnd.github.v3.star+json"} - -response = requests.get(url, headers=headers) -stars = len(response.json()) -print(f"The {owner}/{repo} repository has {stars} stars.") -``` - -##### GitLab - -[GitLab](https://about.gitlab.com/) is a web-based Git repository manager that provides source code management, continuous integration and deployment, and more. It can be used as a data source for metrics related to the usage of open-source software projects, including the number of downloads, stars, and forks. - -To calculate the metric of code downloads/usage counts/stars from GitLab, we need to identify the relevant repositories and extract the relevant information. The number of downloads can be obtained by looking at the download statistics for a particular release of the repository. The number of stars can be obtained by looking at the number of users who have starred the repository. The number of forks can be obtained by looking at the number of users who have forked the repository. - -To access this information, we can use the GitLab API. - -##### Bitbucket - -[Bitbucket](https://bitbucket.org/) is a web-based Git repository hosting service that allows users to host their code repositories, collaborate with other users and teams, and automate their software development workflows. It can be used as a data source for metrics related to the usage of open-source software projects, including the number of downloads, stars, and forks. - -To calculate the metric of code downloads/usage counts/stars from Bitbucket, we need to identify the relevant repositories and extract the relevant information. The number of downloads can be obtained by looking at the download statistics for a particular release of the repository. The number of stars can be obtained by looking at the number of users who have starred the repository. The number of forks can be obtained by looking at the number of users who have forked the repository. - -To access this information, we can use the Bitbucket API, which provides programmatic access to repository data. The API can be queried using HTTP requests, and the resulting data can be parsed and analysed using programming languages such as Python. - -##### Existing methodologies - -##### Ensuring that repositories contain code - -To ensure that a code repository (i.e. Github, Gitlab, Bitbucket) primarily contains code and not data or datasets, one can consider the following checks: +This metric captures the level of interest and impact of Open Code by measuring repository activity such as downloads, usage counts, and stars. Metrics derived from repository platforms like GitHub, GitLab, or Bitbucket can provide insight into how often code is accessed, favorited, or replicated by other users. These indicators are extensively discussed under the academic indicator [Use of Code in Research](https://handbook.pathos-project.eu/sections/2_academic_impact/use_of_code_in_research.html), particularly in the metric "Repository statistics (# Forks/Clones/Stars/Downloads/Views)." For detailed methodologies and measurement approaches, refer to the academic indicator. -- Repository labelling: Look for repositories that are explicitly labelled as containing code or software. Many repository owners provide clear labels or descriptions indicating the nature of the content. -- File extensions: Check for files with common code file extensions, such as .py, .java, or .cpp. These file extensions are commonly used for code files, while data files often have extensions like .csv, .txt, or .xlsx. -- Repository descriptions and README files: Examine the repository descriptions and README files to gain insights into the content. Authors often provide information about the type of code included, its functionality, and its relevance to the project or software. -- Documentation: Some repositories include extensive documentation that provides details on the software, its usage, and how to contribute to the project. This indicates a greater likelihood that the repository primarily contains code. -- Existence of script and source folders: In some cases, the existence of certain directories like '/src' for source files or '/scripts' for scripts can indicate that the repository is primarily for code. +In the context of reproducibility, this indicator emphasizes the implications of repository usage statistics for research transparency and validation. High download counts, numerous forks, or significant stars can signal that the code is well-documented, functional, and useful for replication and extension of research findings. Such engagement often indicates that the Open Code has met the standards required for reproducibility, including accessibility and usability by other researchers. -By considering these checks, we can ensure that the repository primarily contains code rather than data or datasets. \ No newline at end of file +Furthermore, widespread use of Open Code repositories reflects the extent to which shared computational tools are integrated into the research ecosystem. This integration supports cumulative science, where researchers build on existing work rather than duplicating efforts. By tracking these repository metrics, we can better understand how Open Code practices facilitate reproducibility and identify gaps or barriers preventing broader adoption and reuse. \ No newline at end of file diff --git a/sections/5_reproducibility/impact_of_open_data_in_research.qmd b/sections/5_reproducibility/impact_of_open_data_in_research.qmd index b46e4c3..70f3326 100644 --- a/sections/5_reproducibility/impact_of_open_data_in_research.qmd +++ b/sections/5_reproducibility/impact_of_open_data_in_research.qmd @@ -34,145 +34,31 @@ The impact of Open Data in research aims to capture the effect of making researc The indicator can be used to assess the level of openness and accessibility of research data within a specific scientific community or field, and to identify potential barriers or incentives for the adoption of Open Data practices. -# Metrics - -## NCI for publications that have introduced Open Datasets - -This metric calculates the Normalised Citation Impact (NCI) for publications that have introduced Open Datasets. By introducing Open Datasets, researchers enable others to access and verify their findings, thus enhancing the potential for reproducibility. The NCI metric primarily measures the citation impact of a publication, adjusted for differences in citation practices across scientific fields. However, citation impact can also be an indicator of research quality and reproducibility. Therefore, the NCI for publications that have introduced Open Datasets can serve as an indicator of both the visibility, influence, and reproducibility of research findings. - -One limitation of this metric is that the use of NCI has been criticized for its potential biases and limitations, such as the inability to fully account for differences in research quality or the influence of non-citation-based impact measures. Therefore, we recommend using this metric in conjunction with other metrics in this document to obtain a more comprehensive assessment of the impact of Open Data practices on research output. - -### Measurement - -To measure this metric, the process begins with the identification of publications that have introduced Open Datasets. This is typically achieved by scrutinizing metadata within the datasets and the publications, such as the DOI (Digital Object Identifier). Alternatively, explicit mentions of the dataset within the publication text can be extracted and verify their openness. This can be performed manually or using automated tools. - -Upon identification of the relevant publications, it's crucial to categorize them into their respective disciplines. The assignment of disciplines is typically based on the journal where the paper is published, the author's academic department, or the thematic content of the paper. Several databases provide such categorizations, such as [OpenAIRE](https://explore.openaire.eu/fields-of-science), [Scopus](https://www.scopus.com) and [Web of Science](https://www.webofscience.com/wos/woscc/basic-search). - -Finally, the NCI (Normalised Citation Impact) score for each publication is calculated. The NCI measures the citation impact of a publication relative to the average for similar publications in the same discipline, publication year, and document type. It is computed by dividing the total number of citations the publication receives by the average number of citations received by all other similar publications. - -One potential limitation of this approach is that not all Open Datasets may be registered in data repositories, making it challenging to identify all relevant publications. Additionally, the accuracy of the NCI score may be affected by the availability and quality of citation data in different scientific fields. Therefore, it is important to carefully consider the potential biases and limitations of the data sources and methodologies used to measure this metric. - -#### Datasources - -##### OpenAIRE - -OpenAIRE is a European platform that provides Open Access to research outputs, including publications, datasets, and software. OpenAIRE collects metadata from various data sources, including institutional repositories, data repositories, and publishers. - -For the NCI for publications that have introduced Open Datasets metric, we can use OpenAIRE to identify publications that have introduced Open Datasets. We can search for publications by looking for OpenAIRE records that have a dataset identifier in the references section or by using OpenAIRE's API to search for publications that are linked to a specific dataset. - -One limitation of using OpenAIRE for this metric is that not all Open Datasets may be registered in OpenAIRE, which could lead to underestimation of the number of publications that have introduced Open Datasets. - -##### Scopus - -Scopus is a comprehensive expertly curated abstract and citation database that covers scientific journals, conference proceedings, and books across various disciplines. Scopus provides enriched metadata records of scientific articles, comprehensive author and institution profiles, citation counts, as well as calculation of the articles' NCI score using their API. +### Connections to Academic Indicators -One limitation of Scopus is that the calculation of NCI from Scopus only considers documents that are indexed in the Scopus database. This could lead to underestimation or overestimation of the NCI for some publications, depending on how these publications are cited in sources outside the Scopus database. +This indicator examines the broader effects of making research data openly accessible, focusing on how transparency and accessibility enhance reproducibility, collaboration, and innovation within the scientific community. This builds upon the [Use of Data in Research](https://handbook.pathos-project.eu/indicator_templates/sections/2_academic_impact/use_of_data_in_research.html), which evaluates how data is initially utilized within research activities, and the [Reuse of Data in Research](https://handbook.pathos-project.eu/indicator_templates/sections/5_reproducibility/reuse_of_data_in_research.html), which measures the extent to which datasets are adopted in subsequent studies. Together, these indicators provide a comprehensive view of how data contributes to scientific outputs, reusability and reproducibility. -#### Existing methodologies - -##### SciNoBo toolkit - -The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] can be used to classify scientific publications into specific fields of science, which can then be used to calculate their NCI score. The tool utilizes the citation-graph of a publication and its references to identify its discipline and assign it to a specific Field-of-Science (FoS) taxonomy. The classification system of publications is based on the structural properties of a publication and its citations and references organized in a multilayer network. - -Furthermore, a new component of the SciNoBo toolkit, currently undergoing evaluation, involves an automated tool that employs Deep Learning and Natural Language Processing techniques to identify datasets mentioned in the text of publications and extract metadata associated with them, such as name, version, license, URLs etc. This tool can also classify whether the datasets has been introduced by the authors of the publication. - -To measure the proposed metric, the tool can be used to identify relevant publications that have introduced datasets and calculate their NCI score. - -## NCI for publications that have (re)used Open Datasets - -This metric calculates the Normalised Citation Impact (NCI) for publications that have (re)used Open Datasets. It is a measure of the citation impact of research publications that have utilized Open Datasets, adjusted for differences in citation practices across scientific fields. The NCI for publications that have (re)used Open Datasets can indicate the potential impact of data sharing and reuse practices on the visibility and influence of research findings. A higher NCI score indicates a greater level of scientific collaboration and data sharing within a specific scientific community or field, suggesting that the availability of Open Datasets can contribute to the impact and recognition of research, thus indirectly indicating its potential for reproducibility. - -A limitation of this metric is that the use of NCI has been criticized for its potential biases and limitations, such as the inability to fully account for differences in research quality or the influence of non-citation-based impact measures. Therefore, we recommend using this metric in conjunction with other metrics in this document to obtain a more comprehensive assessment of the impact of Open Data practices on research output. - -### Measurement - -To measure this metric, the process begins with the identification of publications that have (re)used Open Datasets. This is typically achieved by scrutinizing metadata within the datasets and the publications, such as the DOI (Digital Object Identifier). Alternatively, explicit mentions of the dataset within the publication text can be extracted and verify their openness. This can be performed manually or using automated tools. - -Upon identification of the relevant publications, it's crucial to categorize them into their respective disciplines. The assignment of disciplines is typically based on the journal where the paper is published, the author's academic department, or the thematic content of the paper. Several databases provide such categorizations, such as [OpenAIRE](https://explore.openaire.eu/fields-of-science), [Scopus](https://www.scopus.com) and [Web of Science](https://www.webofscience.com/wos/woscc/basic-search). - -Finally, the NCI (Normalised Citation Impact) score for each publication is calculated. The NCI measures the citation impact of a publication relative to the average for similar publications in the same discipline, publication year, and document type. It is computed by dividing the total number of citations the publication receives by the average number of citations received by all other similar publications. - -One potential limitation of this approach is that not all Open Datasets may be registered in data repositories, making it challenging to identify all relevant publications. Additionally, the accuracy of the NCI score may be affected by the availability and quality of citation data in different scientific fields. Therefore, it is important to carefully consider the potential biases and limitations of the data sources and methodologies used to measure this metric. - -#### Datasources - -##### Scopus +# Metrics -[Scopus](https://www.scopus.com) is a comprehensive expertly curated abstract and citation database that covers scientific journals, conference proceedings, and books across various disciplines. Scopus provides enriched metadata records of scientific articles, comprehensive author and institution profiles, citation counts, as well as calculation of the articles' NCI score using their API. +## NCI for publications that have introduced/reused Open Data -One limitation of Scopus is that the calculation of NCI from Scopus only considers documents that are indexed in the Scopus database. This could lead to underestimation or overestimation of the NCI for some publications, depending on how these publications are cited in sources outside the Scopus database. +This metric captures the Normalised Citation Impact (NCI) for publications that have either introduced or reused Open Data. By assessing citation impact, this indicator reflects the visibility and influence of research publications that contribute to or benefit from Open Data practices. Citation-based metrics, including the NCI, are extensively discussed under the academic indicator [Citation Impact](https://handbook.pathos-project.eu/sections/2_academic_impact/citation_impact.html). For general details on the methodology, limitations, and measurement of NCI, refer to the academic indicator and its corresponding metrics. -#### Existing methodologies - -##### SciNoBo toolkit +In this metric, we focus specifically on publications that have directly contributed to reproducibility by either introducing new Open Data or reusing existing Open Data. The reuse of Open Data can be identified using methodologies and metrics outlined in the academic indicator [Use of Data in Research](https://handbook.pathos-project.eu/sections/2_academic_impact/use_of_data_in_research.html), which provides tools and techniques for tracking data usage in research publications. Additionally, this indicator highlights publications that explicitly document and share new Open Data repositories. -The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] can be used to classify scientific publications into specific fields of science, which can then be used to calculate their NCI score. The tool utilizes the citation-graph of a publication and its references to identify its discipline and assign it to a specific Field-of-Science (FoS) taxonomy. The classification system of publications is based on the structural properties of a publication and its citations and references organized in a multilayer network. +To measure the NCI for publications that have introduced Open Data, we identify relevant publications through metadata analysis of datasets, such as unique identifiers or DOIs associated with repositories like Zenodo, DataCite, or OpenAIRE. This process can be supported by automated tools, including the SciNoBo toolkit, which uses Deep Learning and Natural Language Processing (NLP) to extract metadata such as the dataset name, version, license, and URLs. These tools enable precise identification of publications introducing Open Data, making it possible to calculate their citation impact. -Furthermore, a new component of the SciNoBo toolkit, currently undergoing evaluation, involves an automated tool that employs Deep Learning and Natural Language Processing techniques to identify datasets mentioned in the text of publications and extract metadata associated with them, such as name, version, license, URLs etc. This tool can also classify whether the datasets has been (re)used by the authors of the publication. +The NCI for publications that have introduced or reused Open Data is particularly relevant for reproducibility because it highlights how data sharing and reuse practices enable verification and extension of scientific findings. Highly cited publications introducing Open Data signal that the data provided novel, generalizable insights to scientific questions, thereby enabling other researchers to replicate and build upon the results. Similarly, publications with high NCI that reuse Open Data indicate that openly shared datasets are not only accessible but integral to advancing research transparency and reproducibility. -To measure the proposed metric, the tool can be used to identify relevant publications that have (re)used datasets and calculate their NCI score. +By focusing on NCI, we can compare publications across disciplines and timeframes, overcoming disparities in citation practices. This ensures that the contribution of Open Data to reproducibility is evaluated equitably, identifying impactful practices and outputs. Furthermore, normalized metrics allow us to track trends in Open Data adoption, evaluate the effectiveness of Open Data policies, and identify areas where further incentives for Open Data practices might be beneficial. Such analysis provides critical insights into the evolving role of Open Data in enhancing scientific reliability and collaboration across research communities. ## Dataset downloads/usage counts/stars from repositories -This metric measures the number of downloads, usage counts, or stars (depending on the repository) of a given Open Dataset. It provides an indication of the level of interest and use of the dataset by the scientific community, and can serve as a proxy for the potential impact of the dataset on scientific research. It should be noted that this metric may not capture the full impact of Open Datasets on scientific research, as the number of downloads or usage counts may not necessarily reflect the quality or impact of the research that utilizes the dataset. - -In terms of reproducibility, high usage counts or stars may indicate that a dataset is well-documented and easy to use. Furthermore, a widely used dataset is more likely to be updated and maintained over time, which can improve its reproducibility. - -One limitation of this metric is that it only captures usage of Open Datasets from specific repositories and may not reflect usage of the same dataset that is hosted elsewhere. Additionally, differences in repository usage and user behaviour may affect the comparability of download/usage count/star data across repositories. Finally, this metric does not capture non-public uses of Open Datasets, such as internal use within an organization or personal use by researchers, which may also contribute to the impact of Open Datasets on scientific research. - -### Measurement - -To measure this metric, we can use data from various data repositories, such as DataCite and Zenodo, or data from OpenAIRE, which provide download or usage statistics for hosted datasets. We can also use platforms such as GitHub or GitLab, which provide star counts as a measure of user engagement with Open Source code repositories that may include Open Datasets. However, it is important to note that different repositories may provide different types of usage statistics, and these statistics may not be directly comparable across repositories. Additionally, not all repositories may track usage statistics, making it difficult to obtain comprehensive data for all Open Datasets. - -The data can be computationally obtained using web scraping tools, API queries, or by manually accessing the download/usage count/star data for each dataset. - -#### Datasources - -##### DataCite - -DataCite is a global registry of research data repositories and datasets, providing persistent identifiers for research data to ensure that they are discoverable, citable, and reusable. The dataset landing pages on DataCite contain information about the dataset, such as metadata, version history, and download statistics. This information can be used to measure the usage and impact of Open Datasets. - -To calculate the usage count of a dataset, we can use the "Views" field provided on the dataset landing page on DataCite, which indicates the number of times the landing page has been accessed. To calculate the number of downloads, we can use the "Downloads" field, which indicates the number of times the dataset has been downloaded. The number of stars or likes can be used as a measure of the popularity of the dataset among users. - -##### Zenodo - -[Zenodo](https://zenodo.org/) is a general-purpose open-access repository developed by CERN to store scientific data. It accepts various types of research outputs, including datasets, software, and publications. Zenodo assigns a unique digital object identifier (DOI) to each deposited item, which can be used to track its usage and citations. - -To calculate the metric of dataset views and downloads from Zenodo, we can extract the relevant metadata from the Zenodo API, which provides programmatic access to the repository's contents. The API allows us to retrieve information about a specific item, such as its title, author, publication date, and number of views / downloads. We can then aggregate this data to obtain usage statistics for a particular dataset or set of datasets. - -##### OpenAIRE - -OpenAIRE is a European Open Science platform that provides access to millions of openly available research publications, datasets, software, and other research outputs. OpenAIRE aggregates content from various sources, including institutional and thematic repositories, data archives, and publishers. This platform provides usage statistics for each research output in the form of downloads, views, and citations, which can be used to measure the impact and reuse of research outputs, including Open Datasets. - -To calculate this metric using OpenAIRE, we can retrieve the download and view counts for the relevant Open Datasets, which can be accessed through the OpenAIRE REST API. The API returns JSON-formatted metadata for each research output, which includes information such as the title, authors, publication date, download counts, and view counts. The download and view counts can be used to calculate the total number of times the dataset has been accessed or viewed, respectively. - -##### GitHub - -[GitHub](https://github.com/) is a web-based platform used for version control and collaborative software development. It allows users to create and host code repositories, including those for Open Source software and datasets. The number of downloads, usage counts, and stars on GitHub can be used as a metric for the impact and popularity of Open Datasets. - -To measure this metric, we can search for the relevant repositories on GitHub and extract the relevant download, usage, and star data. This data can be accessed via the GitHub API, which provides programmatic access to repository data. The API can be queried using HTTP requests, and the resulting data can be parsed and analysed using programming languages such as Python. - -##### GitLab - -[GitLab](https://about.gitlab.com/) is a web-based Git repository manager that provides source code management, continuous integration and deployment, and more. It can be used as a data source for metrics related to the usage of open-source software projects, including the number of downloads, stars, and forks. - -To calculate the metric of dataset downloads/usage counts/stars from GitLab, we need to identify the relevant repositories and extract the relevant information. The number of downloads can be obtained by looking at the download statistics for a particular release of the repository. The number of stars can be obtained by looking at the number of users who have starred the repository. The number of forks can be obtained by looking at the number of users who have forked the repository. - -To access this information, we can use the GitLab API. - -##### Existing methodologies - -##### Ensuring that repositories contain data - -To ensure that a repository (i.e. Github, Gitlab) primarily contains research data and not code, we can consider the following methodology: +This metric captures the level of interest and impact of Open Datasets by measuring repository activity such as downloads, usage counts, and stars. Metrics derived from repository platforms like Zenodo, DataCite, or GitHub can provide insight into how often datasets are accessed, favorited, or reused by other users. These indicators are extensively discussed under the academic indicator [Use of Data in Research](https://handbook.pathos-project.eu/sections/2_academic_impact/use_of_data_in_research.html), particularly in the metric "Number (Avg.) of views/clicks/downloads from repository." For detailed methodologies and measurement approaches, refer to the academic indicator. -- Repository labelling: Look for repositories that are explicitly labelled as containing data or datasets. Many repository owners provide clear labels or descriptions indicating the nature of the content. -- File extensions: Check for files with common data file extensions, such as .csv, .txt, or .xlsx. These file extensions are commonly used for data files, while code files often have extensions like .py, .java, or .cpp. -- Repository descriptions and README files: Examine the repository descriptions and README files to gain insights into the content. Authors often provide information about the type of data included and its relevance to research. -- Data availability statements: Some repositories include data availability statements that provide details on where the data supporting the reported results can be found. These statements may include links to publicly archived datasets or references to specific repositories. -- Supplementary materials: In some cases, authors may publish supplementary materials alongside their research articles. These materials can include datasets and provide additional information about the data and its relevance to the research. +In the context of reproducibility, this indicator highlights how repository usage statistics reflect the accessibility and usability of Open Datasets. High download counts, frequent views, or significant repository engagement often suggest that datasets are well-documented, standardized, and integral to the reproducibility of scientific findings. These metrics serve as proxies for the broader acceptance and utility of the datasets within the research community. -By considering these checks, we can ensure that the repository primarily contains research data rather than code. +Furthermore, repository metrics provide insights into the integration of Open Data practices within the research ecosystem. Datasets that are widely accessed and reused facilitate cumulative science by enabling researchers to verify existing results, build upon prior work, and avoid redundant data collection efforts. By monitoring repository engagement, we can evaluate the effectiveness of Open Data practices in promoting transparency and reproducibility while identifying gaps in documentation, accessibility, or usability that may hinder broader adoption and reuse of Open Data. ## Downloads / views of published DMPs diff --git a/sections/5_reproducibility/reuse_of_code_in_research.qmd b/sections/5_reproducibility/reuse_of_code_in_research.qmd index d528eb2..8ca739c 100644 --- a/sections/5_reproducibility/reuse_of_code_in_research.qmd +++ b/sections/5_reproducibility/reuse_of_code_in_research.qmd @@ -32,102 +32,18 @@ title: Reuse of code in research The reuse of code or software in research refers to the practice of utilising existing code or software to develop new research tools, methods, or applications. It is becoming increasingly important in various scientific fields, including computer science, engineering, and data analysis, because it directly contributes to scientific reproducibility by enabling other researchers to validate the findings without the need to recreate the software or tools from scratch. Additionally, it is an indicator of research quality, as repeated use of code or software often signals robustness and reliability. Furthermore, a high percentage of research projects reusing code within a particular field could be an indication of strong collaboration and trust within the scientific community. This indicator aims to capture the extent to which researchers engage in the reuse of code or software in their research by quantifying the number and proportion of studies that utilise existing code or software. The indicator can be used to assess the level of collaboration and sharing of resources within a specific scientific community or field and to identify potential barriers or incentives for the reuse of code or software in research. Additionally, it can serve as a measure of the quality and reliability of research, as the reuse of code or software can increase the transparency, replicability, and scalability of research findings. -# Metrics - -## Number of code/software reused in publications - -This metric quantifies the number of times existing code or software has been reused in published research articles. A higher number of instances of code or software reuse in publications suggests a strong culture of code and resources dissemination and building upon existing research within a scientific community or field. - -A limitation of this metric is that it may not capture all instances of code or software reuse, as some researchers may reuse code or software without explicitly citing the original source. This challenge is further exacerbated by the fact that standards of code/software citation are still relatively poor, making the identification of all instances of code/software reuse across research fields problematic. Additionally, this metric may not account for the quality or appropriateness of the reused code or software for the new research questions. Furthermore, it may be challenging to compare the number of instances of code or software reuse in publications across different fields, as some fields may rely more heavily on developing new code or software rather than reusing existing resources. - -### Measurement - -An initial step to measure the number of reused code/software in publications can be to count the code/software citations linked with each code/software. This basic strategy, despite being prone to some noise, serves as a fundamental measure for this metric. For a more comprehensive and accurate estimate, we can use tools like text mining and machine learning, including Natural Language Processing (NLP) applied to full texts. These tools help us find code or software reuse statements, or directly pull out datasets from a publication and label them as reused. - -However, these methods may face challenges such as inconsistencies in reporting of code or software reuse, variations in the degree of specificity in reporting of the reuse, and difficulties in distinguishing between code or software that is reused versus code or software that is developed anew but shares similarities with existing code or software. Furthermore, the availability and quality of the automated tools may vary across different research fields and may require domain-specific adaptations. - -#### Datasources - -##### OpenAIRE - -[OpenAIRE](https://www.openaire.eu/) is a European Open Science platform that provides access to millions of openly available research publications, datasets, software, and other research outputs. OpenAIRE aggregates content from various sources, including institutional and thematic repositories, data archives, and publishers. This platform provides usage statistics for each research output in the form of downloads, views, and citations, which can be used to measure the impact and reuse of research outputs, including code/software. - -To measure the proposed metric, [OpenAIRE Explore](https://explore.openaire.eu/) can be used to find and access Open Software, study their usage statistics, and identify the research publications that reference them. - -However, it's important to note that OpenAIRE Explore does not provide comprehensive data for directly calculating the metric, but rather provides the publication references of each Open Software that need to be analysed. - -##### CZI Software Mentions - -The CZI Software Mentions Dataset [@istrate] is a resource released by the Chan Zuckerberg Initiative (CZI) that provides software mentions extracted from a large corpus of scientific literature. Specifically, the dataset provides access to 67 million software mentions derived from 3.8 million open-access papers in the biomedical literature from PubMed Central and 16 million full-text papers made available to CZI by various publishers. - -A key limitation of this dataset is its focus on biomedical science, meaning it may not provide a comprehensive view of software usage in other scientific disciplines. - -To calculate the proposed metric, one could use the CZI Software Mentions Dataset to identify the frequency and distribution of mentions of specific software tools across different scientific papers. The dataset also contains links to software repositories (like PyPI, CRAN, Bioconductor, SciCrunch, and GitHub) which can be used to gather more metadata about the software tools. - -#### Existing methodologies - -##### SciNoBo Toolkit - -The SciNoBo toolkit [@gialitsis2022b; @kotitsas2023b] has a new component, currently undergoing evaluation, which involves an automated tool, leveraging Deep Learning and Natural Language Processing techniques to identify code/software mentioned in the text of publications and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the code/software has been reused by the authors of the publication. - -To measure the proposed metric, the tool can be used to identify the reused code/software in the publication texts. - -One limitation of this methodology is that it may not capture all instances of code/software reuse if they are not explicitly mentioned in the text of the publication. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a code/software has been reused and may require manual validation. - -##### DataSeer.ai +### Connections to Academic Indicators -[DataSeer.ai](https://dataseer.ai/) is a platform that utilizes machine learning and Natural Language Processing (NLP) to facilitate the detection and extraction of datasets, methods, and software mentioned in academic papers. The platform can be used to identify instances of software/code reuse within the text of research articles and extract associated metadata. +This indicator emphasizes the adoption and utilization of existing code or software in subsequent studies, focusing on its role in enhancing reproducibility, collaboration, and research quality. In contrast, the [Use of Code in Research](https://handbook.pathos-project.eu/indicator_templates/sections/2_academic_impact/use_of_code_in_research.html) measures the initial incorporation of code or software into research activities, providing insights into its contribution to the research process itself. Furthermore, the [Impact of Open Code in Research](https://handbook.pathos-project.eu/indicator_templates/sections/5_reproducibility/impact_of_open_code_in_research.html) extends this perspective by evaluating the broader effects of making code or software openly accessible, fostering transparency, and driving innovation across the scientific community. -To measure the proposed metric, DataSeer.ai can scan the body of text in research articles and identify instances of code/software reuse. - -However, it is important to note that the ability of DataSeer.ai to determine actual code/software reuse may depend on the explicitness of the authors' writing about their code/software usage, thus not capturing all instances of code/software reuse if they are not explicitly mentioned in the text. Moreover, the machine learning algorithms used by the tool may not always accurately classify whether a code or software has been reused, and may require manual validation. - -## Number (%) of publications with reused code/software - -This metric quantifies the number or percentage of publications that explicitly mention the reuse of existing code or software. It provides an indication of the extent to which researchers are utilizing existing resources to develop new research tools, methods, or applications, within a specific scientific field or task. - -A limitation of this metric is that it may not capture all instances of code or software reuse, as some researchers may reuse code or software without explicitly citing the original source. Additionally, it may not account for the quality or appropriateness of the reused code or software for the new research questions. Furthermore, it may be challenging to compare the number or percentage of publications with reused code or software across different fields, as some fields may rely more heavily on developing new code or software rather than reusing existing resources. - -### Measurement - -To measure the number or percentage of publications with reused code or software, automatic text mining and machine learning techniques can be used to search for code or software reuse statements, or to identify reused code or software within published research articles, such as the new component of the SciNoBo toolkit. - -To measure the percentage of publications with reused code/software, we start by using automatic text mining and/or machine learning techniques to identify whether a publication uses/analyses code or software. This involves searching for keywords and phrases associated with the methodologies and use of code or software within the text of the publications. Next, among the identified publications, we search for code or software reuse statements, or directly extract the code/software from the publications and try to classify them as reused, reporting the percentage of those publications. - -However, these methods may face challenges such as inconsistencies in reporting of code or software reuse, variations in the degree of specificity in reporting of the reuse, and difficulties in distinguishing between code or software that is reused versus code or software that is developed anew but shares similarities with existing code or software. Furthermore, the availability and quality of the automated tools may vary across different research fields and may require domain-specific adaptations. - -#### Datasources - -##### OpenAIRE - -[OpenAIRE](https://www.openaire.eu/) is a European Open Science platform that provides access to millions of openly available research publications, datasets, software, and other research outputs. OpenAIRE aggregates content from various sources, including institutional and thematic repositories, data archives, and publishers. This platform provides usage statistics for each research output in the form of downloads, views, and citations, which can be used to measure the impact and reuse of research outputs, including code/software. - -To measure the proposed metric, [OpenAIRE Explore](https://explore.openaire.eu/) can be used to find and access Open Software, study their usage statistics, and identify the research publications that reference them. - -However, it's important to note that OpenAIRE Explore does not provide comprehensive data for directly calculating the metric, but rather provides the publication references of each Open Software that need to be analysed. - -##### CZI Software Mentions - -The CZI Software Mentions Dataset [@istrate] is a resource released by the Chan Zuckerberg Initiative (CZI) that provides software mentions extracted from a large corpus of scientific literature. Specifically, the dataset provides access to 67 million software mentions derived from 3.8 million open-access papers in the biomedical literature from PubMed Central and 16 million full-text papers made available to CZI by various publishers. - -A key limitation of this dataset is its focus on biomedical science, meaning it may not provide a comprehensive view of software usage in other scientific disciplines. - -To calculate the proposed metric, one could use the CZI Software Mentions Dataset to identify the frequency and distribution of mentions of specific software tools across different scientific papers. The dataset also contains links to software repositories (like PyPI, CRAN, Bioconductor, SciCrunch, and GitHub) which can be used to gather more metadata about the software tools. - -#### Existing methodologies - -##### SciNoBo Toolkit - -The SciNoBo toolkit [@gialitsis2022b; @kotitsas2023b] has a new component, currently undergoing evaluation, which involves an automated tool, leveraging Deep Learning and Natural Language Processing techniques to identify code/software mentioned in the text of publications and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the code/software has been reused by the authors of the publication. - -To measure the proposed metric, the tool can be used to identify the reused code/software in the publication texts. +# Metrics -One limitation of this methodology is that it may not capture all instances of code/software reuse if they are not explicitly mentioned in the text of the publication. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether code/software has been reused, and may require manual validation. +## Number of code/software reused in publications -##### DataSeer.ai +This metric emphasizes the adoption and utilization of existing code or software in subsequent studies, focusing specifically on its role in enhancing reproducibility, collaboration, and research quality. The reuse of code in research strengthens reproducibility by allowing other researchers to validate findings and build upon existing methods and tools. -[DataSeer.ai](https://dataseer.ai/) is a platform that utilizes machine learning and Natural Language Processing (NLP) to facilitate the detection and extraction of datasets, methods, and software mentioned in academic papers. The platform can be used to identify instances of dataset reuse within the text of research articles and extract associated metadata. +This closely aligns with the metrics in the [Use of Code in Research](https://handbook.pathos-project.eu/sections/2_academic_impact/use_of_code_in_research.html) under the academic indicators, specifically the number of mentions of code or software in publications. For further details on measurement, including text mining tools, and bibliometric databases, refer to the academic indicator. -To measure the proposed metric, DataSeer.ai can scan the body of text in research articles and identify instances of code/software reuse. +In the context of reproducibility, the reuse of code indicates that methods and processes described in research publications are transparent and accessible. When researchers reuse code, they signal that the original research is sufficiently documented and functional to support replication. This is a cornerstone of open science, as reproducible research enables validation of results, ensuring the robustness of scientific knowledge and minimizing errors. The extent of code reuse also highlights the community’s trust in the reliability and quality of the code, as widely adopted software is likely to have undergone rigorous validation by multiple users. -However, it is important to note that DataSeer.ai's ability to determine actual code/software reuse may depend on the explicitness of the authors' writing about their code/software usage, thus not capturing all instances of code/software reuse if they are not explicitly mentioned in the text. Moreover, the machine learning algorithms used by the tool may not always accurately classify whether a code or software has been reused, and may require manual validation. \ No newline at end of file +Furthermore, the act of reusing code fosters interdisciplinary collaboration and accelerates scientific progress. By building on shared resources rather than duplicating efforts, researchers save valuable time and energy. This collaborative approach to software reuse ensures that scientific communities can focus on advancing new knowledge rather than resolving redundant technical challenges. As a result, code reuse acts as a multiplier for reproducibility, allowing not only the original study but also derivative works to be verified and built upon, expanding the scope of reliable and impactful research. \ No newline at end of file diff --git a/sections/5_reproducibility/reuse_of_data_in_research.qmd b/sections/5_reproducibility/reuse_of_data_in_research.qmd index 7f3084f..8c3ca20 100644 --- a/sections/5_reproducibility/reuse_of_data_in_research.qmd +++ b/sections/5_reproducibility/reuse_of_data_in_research.qmd @@ -32,96 +32,18 @@ title: Reuse of data in research The reuse of data in research refers to the practice of utilizing existing data sets for new research questions. It is a common practice in various scientific fields, and it can lead to increased scientific efficiency, reduced costs, and enhanced scientific collaborations. Additionally, the reuse of well-documented data can serve as an independent verification of original findings, thereby enhancing the reproducibility of research. This indicator aims to capture the extent to which researchers engage in the reuse of data in their research, by quantifying the number and proportion of studies that utilize previously collected data. The indicator can be used to assess the level of scientific collaboration and sharing of data within a specific scientific community or field, and to identify potential barriers or incentives for the reuse of data in research. Additionally, it can serve as a measure of the quality and reliability of research, as the reuse of data can increase the transparency, validity, and replicability of research findings. -# Metrics - -## Number of datasets reused in publications - -This metric quantifies the number of datasets that have been reused in published research articles. A higher number of datasets reused in publications suggests a strong culture of data dissemination and building upon existing research within a scientific community or field. - -A limitation of this metric is that it may not capture all instances of data reuse, as some researchers may reuse data sets without explicitly citing the original source. This challenge is further exacerbated by the fact that standards of data citation are still relatively poor, making the identification of all instances of data reuse across research fields problematic. Additionally, this metric may not account for the quality or appropriateness of the reused data sets for the new research questions. Furthermore, it may be challenging to compare the number of datasets reused in publications across different fields, as some fields may rely more heavily on new data collection rather than data reuse. - -### Measurement - -An initial step to measure the number of reused datasets in publications can be to count the data citations linked with each dataset. This basic strategy, despite being prone to some noise, serves as a fundamental measure for this metric. For a more comprehensive and accurate estimate, we can use tools like text mining and machine learning, including Natural Language Processing (NLP) applied to full texts. These tools help us find data reuse statements, data availability statements, or directly pull out datasets from a publication and label them as reused. - -However, these methods may face challenges such as inconsistencies in reporting of reused data, and variations in the degree of specificity in the reporting of the reuse. Additionally, the availability and quality of the data extraction tools may vary across different research fields and may require domain-specific adaptations. - -#### Datasources - -##### OpenAIRE - -OpenAIRE is a European Open Science platform that provides access to millions of openly available research publications, datasets, software, and other research outputs. OpenAIRE aggregates content from various sources, including institutional and thematic repositories, data archives, and publishers. This platform provides usage statistics for each research output in the form of downloads, views, and citations, which can be used to measure the impact and reuse of research outputs, including Open Datasets. - -To measure the proposed metric, [OpenAIRE Explore](https://explore.openaire.eu/) can be used to find and access Open Datasets, study their usage statistics, and identify the research publications that reference them. - -However, it's important to note that OpenAIRE Explore does not provide comprehensive data for directly calculating the metric, but rather provides the publication references of each Open Software that need to be analysed. - -##### DataCite - -[DataCite](https://datacite.org/) is a global registry of research data repositories and datasets, providing persistent identifiers for research data to ensure that they are discoverable, citable, and reusable. The dataset landing pages on DataCite contain information about the dataset, such as metadata, version history, and download statistics. - -To measure the proposed metric, we can employ the DataCite REST API to identify relevant datasets, along with to find their DOI, metadata and usage statistics. - -#### Existing methodologies - -##### SciNoBo Toolkit - -The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] has a new component, currently undergoing evaluation, which involves an automated tool, leveraging Deep Learning and Natural Language Processing techniques to identify datasets mentioned in the text of publications and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the dataset has been reused by the authors of the publication. - -To measure the proposed metric, the tool can be used to identify the reused datasets in the publication texts. - -One limitation of this methodology is that it may not capture all instances of dataset reuse if they are not explicitly mentioned in the text of the publication. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a dataset has been reused and may require manual validation. - -##### DataSeer.ai +### Connections to Academic Indicators -[DataSeer.ai](https://dataseer.ai/) is a platform that utilizes machine learning and Natural Language Processing (NLP) to facilitate the detection and extraction of datasets, methods, and software mentioned in academic papers. The platform can be used to identify instances of dataset reuse within the text of research articles and extract associated metadata. +This indicator emphasizes the adoption and utilization of existing datasets for new research purposes, highlighting its role in enhancing reusability, reproducibility, collaboration, and research efficiency. In contrast, the [Use of Data in Research](https://handbook.pathos-project.eu/indicator_templates/sections/2_academic_impact/use_of_data_in_research.html) focuses on the initial incorporation of data into research activities and its contributions to academic outputs. Furthermore, the [Impact of Open Data in Research](https://handbook.pathos-project.eu/indicator_templates/sections/5_reproducibility/impact_of_open_data_in_research.html) extends this perspective by evaluating how openly shared datasets foster transparency, accessibility, and innovation across the scientific community. -To measure the proposed metric, DataSeer.ai can scan the body of text in research articles and identify instances of dataset reuse. - -However, it is important to note that DataSeer.ai's ability to determine actual data reuse may depend on the explicitness of the authors' writing about their data usage, thus not capturing all instances of dataset reuse if they are not explicitly mentioned in the text. Moreover, the machine learning algorithms used by the tool may not always accurately classify whether a dataset has been reused, and may require manual validation. - -## Number (%) of publications with reused datasets - -This metric quantifies the number or percentage of publications that explicitly mention the reuse of previously collected datasets. It is a useful metric for assessing the extent to which researchers are engaging in the reuse of data in their research, within a specific scientific field or task. - -A limitation of this metric is that it may not capture all instances of data reuse, as some researchers may reuse data sets without explicitly citing the original source. Additionally, it may not account for the quality or appropriateness of the reused data sets for the new research questions. Furthermore, it may be challenging to compare the number or percentage of publications with reused datasets across different fields, as some fields may rely more heavily on new data collection rather than data reuse. - -### Measurement - -To measure the percentage of publications with reused datasets, we start by using automatic text mining and/or machine learning techniques to identify whether a publication uses/analyses data. This involves searching for keywords and phrases associated with data analysis within the text of the publications. Next, among the identified data-analysing publications, we search for data reuse statements, data availability statements, or directly extract the datasets from the publications and try to classify them as reused, reporting the percentage of those publications. - -However, these methods may face challenges such as inconsistencies in reporting of reused data, and variations in the degree of specificity in the reporting of the reuse. Additionally, the availability and quality of the data extraction tools may vary across different research fields and may require domain-specific adaptations. - -#### Datasources - -##### OpenAIRE - -OpenAIRE is a European Open Science platform that provides access to millions of openly available research publications, datasets, software, and other research outputs. OpenAIRE aggregates content from various sources, including institutional and thematic repositories, data archives, and publishers. This platform provides usage statistics for each research output in the form of downloads, views, and citations, which can be used to measure the impact and reuse of research outputs, including Open Datasets. - -To measure the proposed metric, [OpenAIRE Explore](https://explore.openaire.eu/) can be used to find and access Open Datasets, study their usage statistics, and identify the research publications that reference them. - -However, it is important to note that OpenAIRE Explore does not provide comprehensive data for directly calculating the metric, but rather provides the publication references of each Open Software that need to be analysed. - -##### DataCite - -[DataCite](https://datacite.org/) is a global registry of research data repositories and datasets, providing persistent identifiers for research data to ensure that they are discoverable, citable, and reusable. The dataset landing pages on DataCite contain information about the dataset, such as metadata, version history, and download statistics. - -To measure the proposed metric, we can employ the DataCite REST API to identify relevant datasets, along with to find their DOI, metadata and usage statistics. - -#### Existing methodologies - -##### SciNoBo Toolkit - -The SciNoBo toolkit [@gialitsis2022; @kotitsas2023] has a new component, currently undergoing evaluation, which involves an automated tool, leveraging Deep Learning and Natural Language Processing techniques to identify datasets mentioned in the text of publications and extract metadata associated with them, such as name, version, license, etc. This tool can also classify whether the dataset has been reused by the authors of the publication. - -To measure the proposed metric, the tool can be used to identify reused datasets in publication texts. +# Metrics -One limitation of this methodology is that it may not capture all instances of dataset reuse if they are not explicitly mentioned in the text of the publication. Additionally, the machine learning algorithms used by the tool may not always accurately classify whether a dataset has been reused and may require manual validation. +## Number of datasets reused in publications -##### DataSeer.ai +This metric emphasizes the reuse of datasets in research publications, highlighting its importance in fostering reproducibility, collaboration, and research efficiency. The reuse of datasets serves as an essential mechanism for validating findings and building upon prior research, enhancing the transparency and replicability of scientific outputs. -[DataSeer.ai](https://dataseer.ai/) is a platform that utilizes machine learning and Natural Language Processing (NLP) to facilitate the detection and extraction of datasets, methods, and software mentioned in academic papers. The platform can be used to identify instances of dataset reuse within the text of research articles and extract associated metadata. +This aligns closely with the metrics discussed in the [Use of Data in Research](https://handbook.pathos-project.eu/indicator_templates/sections/2_academic_impact/use_of_data_in_research.html) under the academic indicators. Specifically, the measurement of dataset mentions or citations in publications provides the foundation for assessing both the use and reuse of data. For further details on measurement methodologies, such as text mining, and the role of data repositories, refer to the academic indicator. -To measure the proposed metric, DataSeer.ai can scan the body of text in research articles and identify instances of dataset reuse. +In the context of reproducibility, the reuse of datasets reflects the scientific community's ability to leverage existing data to answer new research questions. It underscores the importance of effective data sharing practices, robust metadata, and clear licensing, as these enable other researchers to trust, access, and incorporate datasets into their work. Furthermore, higher levels of data reuse often indicate stronger collaboration and trust within a scientific field, which are critical for advancing reproducible research. -However, it is important to note that DataSeer.ai's ability to determine actual data reuse may depend on the explicitness of the authors' writing about their data usage, thus not capturing all instances of dataset reuse if they are not explicitly mentioned in the text. Moreover, the machine learning algorithms used by the tool may not always accurately classify whether a dataset has been reused and may require manual validation. \ No newline at end of file +By interpreting dataset reuse through the lens of reproducibility, this indicator also highlights the extent to which researchers adopt transparent and open research practices. The repeated utilization of datasets ensures that original findings are validated and that the data itself is robust, reliable, and suitable for diverse applications.