feat: Sensitive Data Detection in files like (.csv , .xlsx , json) #761
feat: Sensitive Data Detection in files like (.csv , .xlsx , json) #761Psingle20 wants to merge 11 commits into
Conversation
✅ Deploy Preview for endearing-brigadeiros-63f9d0 canceled.
|
laukik-target
left a comment
There was a problem hiding this comment.
The code is well-structured, handles different file types effectively.
few improvements are recommended:
- Test Coverage: Add tests for no sensitive data, empty files, and file-not-found scenarios.
- Optimization: Consider streaming large files for better memory management.
Signed-off-by: ChaitanyaD48 <chaitanya.d48@gmail.com>
|
@coopernetes @JamieSlome Could you please review this PR and share your thoughts? |
| const sensitivePatterns = [ | ||
| /\d{3}-\d{2}-\d{4}/, // Social Security Number (SSN) | ||
| /\b\d{16}\b/, // Credit card numbers | ||
| /\b\d{5}-\d{4}\b/, // ZIP+4 codes | ||
| // Add more patterns as needed | ||
| ]; |
There was a problem hiding this comment.
The intent behind this change is good, though it must be noted these will produce a large number of false positives.
Ideally this wouldn't block (only warn), or would have an easy way to exclude false positives.
There was a problem hiding this comment.
I agree. Not to mention, this does not cover all geographies.
I'm inclined to merge it as it is not configured by default. A more holistic approach with better heuristics is worth investing in for the GitProxy project granted but this is a good enough start.
| const simpleGit = require('simple-git') | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Mistaken change?
|
|
||
|
|
||
| exec.displayName = 'logFileChanges.exec'; | ||
| exports.exec = exec; No newline at end of file |
There was a problem hiding this comment.
| exports.exec = exec; | |
| exports.exec = exec; | |
| exports.checkAuthorEmails = require('./checkAuthorEmails').exec; | ||
| exports.checkUserPushPermission = require('./checkUserPushPermission').exec; | ||
| exports.clearBareClone = require('./clearBareClone').exec; | ||
| exports.checkSensitiveData = require('./checkSensitiveData').exec; No newline at end of file |
There was a problem hiding this comment.
Add missing newline at the end of the file.
| exports.checkSensitiveData = require('./checkSensitiveData').exec; | |
| exports.checkSensitiveData = require('./checkSensitiveData').exec; | |
| const exec = async (req, action) => { | ||
| const diffStep = action.steps.find((s) => s.stepName === 'diff'); | ||
| const step = new Step('checksensitiveData'); | ||
|
|
||
| if (diffStep && diffStep.content) { | ||
| console.log('Diff content:', diffStep.content); | ||
|
|
||
| // Use the parsing function to get file paths | ||
| const filePaths = extractFilePathsFromDiff(diffStep.content); | ||
|
|
||
| if (filePaths.length > 0) { | ||
| // Check for sensitive data in all files | ||
| const sensitiveDataFound = await Promise.all(filePaths.map(parseFile)); | ||
| const anySensitiveDataDetected = sensitiveDataFound.some(found => found); | ||
|
|
||
| if (anySensitiveDataDetected) { | ||
| step.blocked= true; | ||
| step.error = true; | ||
| step.errorMessage = 'Your push has been blocked due to sensitive data detection.'; | ||
| console.log(step.errorMessage); | ||
| } | ||
| } else { | ||
| console.log('No file paths provided in the diff step.'); | ||
| } | ||
| } else { | ||
| console.log('No diff content available.'); | ||
| } | ||
| action.addStep(step); | ||
| return action; // Returning action for testing purposes | ||
| }; |
There was a problem hiding this comment.
@Psingle20 since #793 has been released in 1.6.0, can we restructure this functionality into its own plugin? It'll require moving some files around and creating an npm package using npm init.
The other change will be this:
const Step = require('@finos/git-proxy/src/proxy/actions').Step;
const config = require('@finos/git-proxy/src/config');
Use plugins/git-proxy-sample-plugins and refer to the docs (to be improved via #811) for details.
There was a problem hiding this comment.
I will do that. Btw we have added some refactor in #810 which also contains some more feature should I do a PR individually for all the plugins ?
ALong with this we have added gitleaks support , EXIF metadata check and AIML usage check.
| fs.mkdirSync(testDataPath, { recursive: true }); // Using recursive to ensure all directories are created | ||
| } | ||
| // Write the Excel file to the test_data directory | ||
| XLSX.writeFile(workbook, path.join(testDataPath, 'sensitive_data2.xlsx')); No newline at end of file |
There was a problem hiding this comment.
Add missing newline at the end of the file.
| XLSX.writeFile(workbook, path.join(testDataPath, 'sensitive_data2.xlsx')); | |
| XLSX.writeFile(workbook, path.join(testDataPath, 'sensitive_data2.xlsx')); | |
| pullRemote: sinon.stub(), | ||
| writePack: sinon.stub(), | ||
| getDiff: sinon.stub(), | ||
| checkSensitiveData : sinon.stub(), |
There was a problem hiding this comment.
By moving this new functionality in its own plugin, it will keep the proxy chain easier to test. Having to constantly add new functions here will not be maintainable long term. This test itself is also not exactly well structured or easy to maintain so we want to not add to it as much as possible. Plugins are preferred.
| const sensitivePatterns = [ | ||
| /\d{3}-\d{2}-\d{4}/, // Social Security Number (SSN) | ||
| /\b\d{16}\b/, // Credit card numbers | ||
| /\b\d{5}-\d{4}\b/, // ZIP+4 codes | ||
| // Add more patterns as needed | ||
| ]; |
There was a problem hiding this comment.
I agree. Not to mention, this does not cover all geographies.
I'm inclined to merge it as it is not configured by default. A more holistic approach with better heuristics is worth investing in for the GitProxy project granted but this is a good enough start.
Co-authored-by: Thomas Cooper <coopernetes@proton.me>
Co-authored-by: Thomas Cooper <coopernetes@proton.me>
Co-authored-by: Thomas Cooper <coopernetes@proton.me>
There was a problem hiding this comment.
@Psingle20 Thanks for the contribution!
These changes seem more appropriate for a plugin rather than a push action, as many organizations don't need to scan those specific file types for those specific patterns. The commitConfig.diff config property along with scanDiff already allows scanning .csv and .json files which are stored in plaintext.
It'd be great to adapt the Excel checker into its own plugin, so it can be used by users that really need it instead of slowing down the default push chain. Check out the plugin guide for details on how to do this.
I'll close this PR to keep our backlog clean. Feel free to open a new PR with the requested changes 🙂
This PR introduces the checkSensitiveData feature, which enhances the security by scanning files like .csv for vulnerabilities and sensitive information.
The implementation includes:
Functionality:
I think this Functionality Solves the issue #745
you can run the custom test implemented using command
npx mocha test/SensitiveData.test.jsEdit
Proxy.config.jsonand add the file ext intoProxyFileTypesarray . Eg :".csv"Also Please run the
test/CreateExcel.jsfile to create a test data for testing.@JamieSlome Please review this PR and suggest any changes necessary
Citi Hackathon
Team Members
Prachit Ingle Psingle20
Shabbir Kaderi shabbirflow
Chaitanya Deshmukh ChaitanyaD48