Skip to content

feat: Sensitive Data Detection in files like (.csv , .xlsx , json) #761

Closed
Psingle20 wants to merge 11 commits into
finos:mainfrom
Psingle20:CheckFilesData
Closed

feat: Sensitive Data Detection in files like (.csv , .xlsx , json) #761
Psingle20 wants to merge 11 commits into
finos:mainfrom
Psingle20:CheckFilesData

Conversation

@Psingle20

@Psingle20 Psingle20 commented Oct 26, 2024

Copy link
Copy Markdown

This PR introduces the checkSensitiveData feature, which enhances the security by scanning files like .csv for vulnerabilities and sensitive information.
The implementation includes:

Functionality:

  • Created a push-action CheckSensitiveData which take this diff and scan the changed files for Sensitive Information.
  • Integrated it push_action chain .
  • Implemented a Test file for the push-action and modified the chain test file to make sure it works with feature added.

I think this Functionality Solves the issue #745
you can run the custom test implemented using command npx mocha test/SensitiveData.test.js
Edit Proxy.config.json and add the file ext into ProxyFileTypes array . Eg : ".csv"
Also Please run the test/CreateExcel.js file to create a test data for testing.

@JamieSlome Please review this PR and suggest any changes necessary

Citi Hackathon
Team Members
Prachit Ingle Psingle20
Shabbir Kaderi shabbirflow
Chaitanya Deshmukh ChaitanyaD48

@linux-foundation-easycla

linux-foundation-easycla Bot commented Oct 26, 2024

Copy link
Copy Markdown

CLA Signed

The committers listed above are authorized under a signed CLA.

@netlify

netlify Bot commented Oct 26, 2024

Copy link
Copy Markdown

Deploy Preview for endearing-brigadeiros-63f9d0 canceled.

Name Link
🔨 Latest commit 868c074
🔍 Latest deploy log https://app.netlify.com/sites/endearing-brigadeiros-63f9d0/deploys/67513cacebea8d0008a644e4

@Psingle20 Psingle20 changed the title Feat: Sensitive Data Detection in files like (.csv , .xlsx , json) feat: Sensitive Data Detection in files like (.csv , .xlsx , json) Oct 26, 2024
Comment thread .husky/commit-msg

@laukik-target laukik-target left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is well-structured, handles different file types effectively.
few improvements are recommended:

  • Test Coverage: Add tests for no sensitive data, empty files, and file-not-found scenarios.
  • Optimization: Consider streaming large files for better memory management.

Comment thread src/proxy/processors/push-action/checkSensitiveData.js
Comment thread test/CheckSensitive.test.js
@Psingle20

Copy link
Copy Markdown
Author

@coopernetes @JamieSlome Could you please review this PR and share your thoughts?

Comment on lines +9 to +14
const sensitivePatterns = [
/\d{3}-\d{2}-\d{4}/, // Social Security Number (SSN)
/\b\d{16}\b/, // Credit card numbers
/\b\d{5}-\d{4}\b/, // ZIP+4 codes
// Add more patterns as needed
];

@rgmz rgmz Oct 30, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent behind this change is good, though it must be noted these will produce a large number of false positives.

Ideally this wouldn't block (only warn), or would have an easy way to exclude false positives.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point @rgmz ! I will think about this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Not to mention, this does not cover all geographies.

I'm inclined to merge it as it is not configured by default. A more holistic approach with better heuristics is worth investing in for the GitProxy project granted but this is a good enough start.

const simpleGit = require('simple-git')



Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistaken change?

Suggested change



exec.displayName = 'logFileChanges.exec';
exports.exec = exec; No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exports.exec = exec;
exports.exec = exec;

exports.checkAuthorEmails = require('./checkAuthorEmails').exec;
exports.checkUserPushPermission = require('./checkUserPushPermission').exec;
exports.clearBareClone = require('./clearBareClone').exec;
exports.checkSensitiveData = require('./checkSensitiveData').exec; No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing newline at the end of the file.

Suggested change
exports.checkSensitiveData = require('./checkSensitiveData').exec;
exports.checkSensitiveData = require('./checkSensitiveData').exec;

Comment on lines +137 to +166
const exec = async (req, action) => {
const diffStep = action.steps.find((s) => s.stepName === 'diff');
const step = new Step('checksensitiveData');

if (diffStep && diffStep.content) {
console.log('Diff content:', diffStep.content);

// Use the parsing function to get file paths
const filePaths = extractFilePathsFromDiff(diffStep.content);

if (filePaths.length > 0) {
// Check for sensitive data in all files
const sensitiveDataFound = await Promise.all(filePaths.map(parseFile));
const anySensitiveDataDetected = sensitiveDataFound.some(found => found);

if (anySensitiveDataDetected) {
step.blocked= true;
step.error = true;
step.errorMessage = 'Your push has been blocked due to sensitive data detection.';
console.log(step.errorMessage);
}
} else {
console.log('No file paths provided in the diff step.');
}
} else {
console.log('No diff content available.');
}
action.addStep(step);
return action; // Returning action for testing purposes
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Psingle20 since #793 has been released in 1.6.0, can we restructure this functionality into its own plugin? It'll require moving some files around and creating an npm package using npm init.

The other change will be this:

const Step = require('@finos/git-proxy/src/proxy/actions').Step;
const config = require('@finos/git-proxy/src/config');

Use plugins/git-proxy-sample-plugins and refer to the docs (to be improved via #811) for details.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that. Btw we have added some refactor in #810 which also contains some more feature should I do a PR individually for all the plugins ?
ALong with this we have added gitleaks support , EXIF metadata check and AIML usage check.

Comment thread test/CreateExcel.js
fs.mkdirSync(testDataPath, { recursive: true }); // Using recursive to ensure all directories are created
}
// Write the Excel file to the test_data directory
XLSX.writeFile(workbook, path.join(testDataPath, 'sensitive_data2.xlsx')); No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add missing newline at the end of the file.

Suggested change
XLSX.writeFile(workbook, path.join(testDataPath, 'sensitive_data2.xlsx'));
XLSX.writeFile(workbook, path.join(testDataPath, 'sensitive_data2.xlsx'));

Comment thread test/chain.test.js
pullRemote: sinon.stub(),
writePack: sinon.stub(),
getDiff: sinon.stub(),
checkSensitiveData : sinon.stub(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By moving this new functionality in its own plugin, it will keep the proxy chain easier to test. Having to constantly add new functions here will not be maintainable long term. This test itself is also not exactly well structured or easy to maintain so we want to not add to it as much as possible. Plugins are preferred.

Comment thread proxy.config.json Outdated
Comment thread .husky/commit-msg Outdated
Comment thread .gitignore Outdated
Comment on lines +9 to +14
const sensitivePatterns = [
/\d{3}-\d{2}-\d{4}/, // Social Security Number (SSN)
/\b\d{16}\b/, // Credit card numbers
/\b\d{5}-\d{4}\b/, // ZIP+4 codes
// Add more patterns as needed
];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Not to mention, this does not cover all geographies.

I'm inclined to merge it as it is not configured by default. A more holistic approach with better heuristics is worth investing in for the GitProxy project granted but this is a good enough start.

Psingle20 and others added 3 commits December 1, 2024 12:10
Co-authored-by: Thomas Cooper <coopernetes@proton.me>
Co-authored-by: Thomas Cooper <coopernetes@proton.me>
Co-authored-by: Thomas Cooper <coopernetes@proton.me>
@06kellyjac 06kellyjac added the citi-hackathon Related to the Citi India Hackathon (Oct '24) label Sep 25, 2025

@jescalada jescalada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Psingle20 Thanks for the contribution!

These changes seem more appropriate for a plugin rather than a push action, as many organizations don't need to scan those specific file types for those specific patterns. The commitConfig.diff config property along with scanDiff already allows scanning .csv and .json files which are stored in plaintext.

It'd be great to adapt the Excel checker into its own plugin, so it can be used by users that really need it instead of slowing down the default push chain. Check out the plugin guide for details on how to do this.

I'll close this PR to keep our backlog clean. Feel free to open a new PR with the requested changes 🙂

@jescalada jescalada closed this Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

citi-hackathon Related to the Citi India Hackathon (Oct '24) feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants