FPSecretBench: A Dataset of False Positive Software Secrets Reported by Nine Secret Detection Tools

The research work is accepted at the Technical Track of the International Symposium on Empirical Software Engineering and Measurement (ESEM 2023). The accepted paper titled "A Comparative Study of Software Secrets Reporting by Secret Detection Tools" can be found here.

Introduction

According to GitGuardian’s monitoring of public GitHub repositories, secrets sprawl continued accelerating in 2022 by 67% compared to 2021, exposing over 10 million secrets (API keys and other credentials). Though many open-source and proprietary secret detection tools are available, these tools output many false positives, making it difficult for developers to take action and teams to choose one tool out of many. To our knowledge, the secret detection tools are not yet compared and evaluated. The goal of our study is to aid developers in choosing a secret detection tool to reduce the exposure of secrets through an empirical investigation of existing secret detection tools. We present an evaluation of five open-source and four proprietary tools against a benchmark dataset.

In addition to comparing nine secret detection tools, we have curated a dataset FPSecretBench, which contains the false positives reported by the secret detection tools scanned on the benchmark dataset (SecretBench). The dataset will aid in expediting the research on improving the accuracy of the tools.

How to Use

The dataset is stored in Google BigQuery. First, you need to create a Google Cloud Account. Google Cloud gives a $300 free credit after opening the account. You can run SQL queries in Google BigQuery to access the false positive secrets.

Google BigQuery Dataset id (dev-range-332204.fpsecretbench): Google BigQuery contains false positive secrets with additional metadata information regarding the secrets such as repository name, commit id, file path, and start line of where the false positive secret is reported. More details of the metadata is described in Data Overview section.

Important: The researchers and developers who want to use our dataset must contact us. Since the dataset may contain mislabeled true positives, a data protection agreement has to be signed with us to avoid any unethical use of the data. Later, we will give access to the dataset using their email addresses.

Data Overview

The dataset consists of the false-positive secrets reported by nine secret detection tools below when scanned on the SecretBench dataset. The SecretBench is a benchmark dataset consisting of 818 public GitHub repositories.

Tool Name	# False Positive Secrets
git-secrets	89,584
Gitleaks	24,885
Repo-Supervisor	177,658
TruffleHog	85,556
Whispers	414,068
ggshield	134,769
SpectralOps	1,543,217
GitHub-scanner	429
Commercial X	64,933

License:

This project is licensed under the terms of the MIT license. Please check LICENSE for more details.

Ethics:

Since our dataset may contain sensitive information, we will make available to dataset only to researchers and tool developers. The researchers and tool developers will sign an agreement to protect the data from any unethical use.

How to Contribute

Please email us if you want to contribute. See Authors section for contact information.

Authors:

Setu Kumar Basak (sbasak4@ncsu.edu)
Jamison Cox (jcox3@ncsu.edu)
Bradley Reaves (bgreaves@ncsu.edu)
Laurie Willams (lawilli3@ncsu.edu)

Cite our work:

@misc{basak2023comparative,
      title={A Comparative Study of Software Secrets Reporting by Secret Detection Tools}, 
      author={Setu Kumar Basak and Jamison Cox and Bradley Reaves and Laurie Williams},
      year={2023},
      eprint={2307.00714},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FPSecretBench: A Dataset of False Positive Software Secrets Reported by Nine Secret Detection Tools

Table of Contents

Introduction

How to Use

Data Overview

License:

Ethics:

How to Contribute

Authors:

Cite our work:

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FPSecretBench: A Dataset of False Positive Software Secrets Reported by Nine Secret Detection Tools

Table of Contents

Introduction

How to Use

Data Overview

License:

Ethics:

How to Contribute

Authors:

Cite our work:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Packages