Skip to content

setu1421/FPSecretBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

FPSecretBench: A Dataset of False Positive Software Secrets Reported by Nine Secret Detection Tools

License: MIT DOI

The research work is accepted at the Technical Track of the International Symposium on Empirical Software Engineering and Measurement (ESEM 2023). The accepted paper titled "A Comparative Study of Software Secrets Reporting by Secret Detection Tools" can be found here.

Table of Contents

Introduction

According to GitGuardian’s monitoring of public GitHub repositories, secrets sprawl continued accelerating in 2022 by 67% compared to 2021, exposing over 10 million secrets (API keys and other credentials). Though many open-source and proprietary secret detection tools are available, these tools output many false positives, making it difficult for developers to take action and teams to choose one tool out of many. To our knowledge, the secret detection tools are not yet compared and evaluated. The goal of our study is to aid developers in choosing a secret detection tool to reduce the exposure of secrets through an empirical investigation of existing secret detection tools. We present an evaluation of five open-source and four proprietary tools against a benchmark dataset.

In addition to comparing nine secret detection tools, we have curated a dataset FPSecretBench, which contains the false positives reported by the secret detection tools scanned on the benchmark dataset (SecretBench). The dataset will aid in expediting the research on improving the accuracy of the tools.

How to Use

The dataset is stored in Google BigQuery. First, you need to create a Google Cloud Account. Google Cloud gives a $300 free credit after opening the account. You can run SQL queries in Google BigQuery to access the false positive secrets.

  • Google BigQuery Dataset id (dev-range-332204.fpsecretbench): Google BigQuery contains false positive secrets with additional metadata information regarding the secrets such as repository name, commit id, file path, and start line of where the false positive secret is reported. More details of the metadata is described in Data Overview section.

Important: The researchers and developers who want to use our dataset must contact us. Since the dataset may contain mislabeled true positives, a data protection agreement has to be signed with us to avoid any unethical use of the data. Later, we will give access to the dataset using their email addresses.

Data Overview

The dataset consists of the false-positive secrets reported by nine secret detection tools below when scanned on the SecretBench dataset. The SecretBench is a benchmark dataset consisting of 818 public GitHub repositories.

Tool Name # False Positive Secrets
git-secrets 89,584
Gitleaks 24,885
Repo-Supervisor 177,658
TruffleHog 85,556
Whispers 414,068
ggshield 134,769
SpectralOps 1,543,217
GitHub-scanner 429
Commercial X 64,933

License:

This project is licensed under the terms of the MIT license. Please check LICENSE for more details.

Ethics:

Since our dataset may contain sensitive information, we will make available to dataset only to researchers and tool developers. The researchers and tool developers will sign an agreement to protect the data from any unethical use.

How to Contribute

Please email us if you want to contribute. See Authors section for contact information.

Authors:

Cite our work:

@misc{basak2023comparative,
      title={A Comparative Study of Software Secrets Reporting by Secret Detection Tools}, 
      author={Setu Kumar Basak and Jamison Cox and Bradley Reaves and Laurie Williams},
      year={2023},
      eprint={2307.00714},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

About

A dataset of false positives reported by nine secret detection tools

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors