Professional file sanitization tool for privacy-conscious users
- Overview
- Features
- Installation
- Usage
- Command Reference
- Examples
- What Gets Cleaned
- Project Structure
- Reports & Verification
- Troubleshooting
- Contributing
- License
- Author
Metadata Cleaner is a comprehensive file sanitization tool that removes all identifying metadata from files and directories before sharing them on any platform. Protect your privacy when uploading to cloud storage, sharing files, publishing datasets, or collaborating online.
- Universal Privacy Protection: Works with any file sharing platform or cloud storage
- Comprehensive Cleaning: Removes EXIF, ID3, document metadata, execution history, and more
- Post-Sanitization Audit: Automatic verification ensures no sensitive data remains
- Production Ready: Robust error handling, integrity checks, and detailed reporting
- Cross-Platform: Works on Linux and macOS with portable shell scripting
-
🎯 Comprehensive Metadata Removal
- EXIF data from images (GPS, camera info, timestamps)
- ID3 tags from audio files (artist, album, year)
- Document metadata (author, software, creation dates)
- Jupyter notebook execution history and outputs
- Filesystem extended attributes
- Author/date comments in text files
-
🔒 Aggressive Sensitive Data Protection
- Detects and removes passwords, secrets, API keys, tokens
- Sanitizes email addresses →
[EMAIL_REMOVED] - Neutralizes absolute paths:
/home/username/→/home/user/ - Masks private IPs:
192.168.x.x→XXX.XXX.XXX.XXX - Cleans version control metadata (Git)
-
🔍 Automated Post-Sanitization Audit
- Verifies no sensitive data remains after cleaning
- Scans for patterns, emails, paths, and private IPs
- Generates detailed warnings with file locations
- Confirms files are safe for public sharing
-
💾 Smart Backup & Verification
- Optional compressed backups (tar.gz)
- SHA256 checksum generation for integrity verification
- Timestamped backups enable multiple restore points
- File integrity checks (size, format validation, JSON parsing)
-
⚡ Performance & Reliability
- Parallel processing support for large directories
- Cross-platform compatibility (Linux and macOS)
- Dry-run mode to preview changes safely
- Detailed reporting and statistics
Required:
# Debian/Ubuntu
sudo apt install -y libimage-exiftool-perl python3
# Fedora/RHEL
sudo dnf install -y perl-Image-ExifTool python3
# macOS (Homebrew)
brew install exiftool python3Optional (Recommended):
# mat2 provides enhanced cleaning for documents and archives
# Debian/Ubuntu
sudo apt install -y mat2
# macOS
pip3 install mat2# Clone repository
git clone https://github.com/[username]/bash-scripts.git
cd bash-scripts
# Make executable
chmod +x metadata_cleaner.sh
# Verify installation
./metadata_cleaner.sh --help
# Optional: Install to system path
sudo cp metadata_cleaner.sh /usr/local/bin/metadata-cleaner# Download script directly
wget https://raw.githubusercontent.com/[username]/bash-scripts/main/metadata_cleaner.sh
# Or using curl
curl -O https://raw.githubusercontent.com/[username]/bash-scripts/main/metadata_cleaner.sh
# Make executable
chmod +x metadata_cleaner.sh# Clean a single file
./metadata_cleaner.sh document.pdf
# Clean directory recursively
./metadata_cleaner.sh -r /path/to/directory
# Preview changes (dry-run)
./metadata_cleaner.sh -d -r /path/to/directory
# Full sanitization with backup
./metadata_cleaner.sh -r -b -a /path/to/filesFor detailed usage, run:
./metadata_cleaner.sh --help-h, --help Show help message
-v, --verbose Enable verbose output
-d, --dry-run Preview changes without modifying files
-r, --recursive Process directories recursively
-b, --backup Create backup before cleaning
--backup-dir <path> Specify custom backup directory (default: ./metadata_backup)
-c, --compress Compress backup
-a, --aggressive Remove sensitive data patterns (aggressive mode)
--sanitize-git Clean version control metadata (remove remotes, unset local user info)
--checksums Generate SHA256 hashes (added to report when enabled)
--report <file> Generate detailed report to specified file
-p, --parallel Enable parallel processing (uses xargs -P)
-j, --jobs <n> Number of parallel jobs (default: 4)# Always preview first to see what will be changed
./metadata_cleaner.sh -d -r /path/to/share# Create backup + aggressive cleaning + verification
./metadata_cleaner.sh -r -b -a /path/to/share# Clean presentation with backup
./metadata_cleaner.sh -b -a presentation.pptx# Preview first, then clean
./metadata_cleaner.sh -d -r photos/
./metadata_cleaner.sh -r -b -a photos/# Comprehensive cleaning with full audit trail
./metadata_cleaner.sh -r -b \
--backup-dir ./backups/dataset \
-c --checksums \
--report dataset_audit.txt \
-a research_data/# Maximum safety: backup + compress + aggressive + git sanitization + report
./metadata_cleaner.sh -r -b -c -a \
--sanitize-git \
--checksums \
--report full_audit.txt \
/path/to/project# Use 8 parallel jobs for faster processing
./metadata_cleaner.sh -r -p -j 8 -b -a /large/directory# Remove Git metadata and user information
./metadata_cleaner.sh -r -a --sanitize-git repo/- Always test with
-d(dry-run) first to preview changes - Use
-b(backup) before cleaning important files - Combine
-cwith-bfor compressed backups (saves space) - Use
-a(aggressive) to remove sensitive data patterns from file contents - Quote paths with spaces:
"./path with spaces/" - Dry-run skips audit: The post-sanitization audit only runs in live mode
Formats: JPEG, PNG, TIFF, GIF, BMP, WebP
Removed Metadata:
- GPS coordinates and location data
- Camera make, model, and settings
- Software and editing information
- Creation and modification timestamps
- Author and copyright information
Formats: MP3, MP4, AVI, MOV, WAV
Removed Metadata:
- Artist, album, and track information
- Recording location and timestamps
- Encoding software details
- Comments and descriptions
Formats: PDF, DOCX, XLSX, PPTX, ODT
Removed Metadata:
- Author and organization information
- Creation and modification software
- Document statistics (edit time, revisions)
- Template information
- Hidden text and comments
Format: .ipynb
Removed Metadata:
- Cell execution counts
- Cell outputs and results
- Kernel information and session data
- Notebook-level metadata
- User-specific paths
Formats: Markdown, Python, SQL, CSV, JSON, YAML, Shell scripts
Removed Metadata:
- Author comments (
# Author: ...) - Date comments (
# Date: ...,# Created: ...) - Modification history comments
- Sensitive data patterns (aggressive mode)
Universal Cleaning:
- Filesystem timestamps → set to
2000-01-01 - Extended attributes (xattrs)
- File comments and annotations
Removes all metadata from file headers and properties using exiftool and mat2.
Additionally scans and removes:
- Email addresses →
[EMAIL_REMOVED] - Absolute paths →
/home/user/ - Private IP addresses →
XXX.XXX.XXX.XXX - Sensitive keywords: password, secret, api_key, token, credential
- Always preview first: Use
-d(dry-run) to see what will be changed - Create backups: Use
-bbefore cleaning important files - Verify results: Use
exiftoolorgrepto manually verify critical files - Use aggressive mode: Add
-afor maximum privacy protection - Review audit report: Check warnings for any remaining sensitive data
bash-scripts/
├── README.md # Main repository documentation
├── LICENSE # CC BY-NC-SA 4.0 license
├── CONTRIBUTING.md # Contribution guidelines
├── metadata_cleaner.sh # Main sanitization script
├── metadata_cleaner.md # This documentation
├── publish_repo.sh # Repository publishing script
├── publish_repo.md # Publishing documentation
└── life_game/ # Conway's Game of Life implementation
├── life_game.sh
├── install.sh
├── README.md
├── modules/
└── data/
Generate comprehensive reports using the --report flag:
./metadata_cleaner.sh -r -a --report audit.txt /path/to/filesSummary Statistics:
- Total files processed
- Successfully cleaned files
- Skipped files (with reasons)
- Failed files (with errors)
- Sensitive data items removed
- Audit warnings and their severity
- Data volume processed
- Processing time
File-Level Details:
- SHA256 checksums (when
--checksumsenabled) - Individual file processing status
- Specific issues detected per file
- Post-sanitization audit results
Audit Warnings:
- Remaining sensitive patterns with file locations
- Email addresses found
- Absolute paths detected
- Private IP addresses
========================================
Metadata Cleaner Report
========================================
Date: 2025-12-05 14:30:15
Target: /home/user/documents
Mode: LIVE
Recursive: true
Backup: true
Aggressive: true
========================================
=== FILE CHECKSUMS (SHA256) ===
SHA256: a3f5d8e... | /home/user/documents/file1.pdf
SHA256: b7c9e2f... | /home/user/documents/file2.docx
...
========================================
Final Statistics:
========================================
Total files found: 657
Successfully cleaned: 621
Skipped: 36 (excluded/binary)
Failed: 0
Sensitive data items: 39
Audit warnings: 0 (PASSED)
Data processed: ~100 MB
Time elapsed: 0m 3s
========================================
=== POST-SANITIZATION AUDIT ===
[AUDIT RESULT] PASSED - No sensitive data detected
# Generate and review audit report
./metadata_cleaner.sh -r -a --report audit.txt --checksums files/
# Check for remaining issues
grep "AUDIT WARNING" audit.txt
# Verify checksums
grep "SHA256:" audit.txt
# Manual verification with exiftool
exiftool cleaned_file.jpg
# Search for specific patterns
grep -r "password\|secret" cleaned_directory/The post-sanitization audit automatically:
- Scans all cleaned files for sensitive patterns
- Checks for email addresses
- Detects absolute paths with usernames
- Identifies private IP addresses
- Verifies file integrity
- Generates warnings with specific file locations
- Provides PASS/FAIL status
These directories are automatically skipped to avoid interfering with development environments:
- Version Control:
.git - Dependencies:
node_modules - Python:
__pycache__,.venv,venv,env,.env - Build Artifacts:
dist,build - Test/Cache:
.pytest_cache,.mypy_cache,.tox
These file types are automatically skipped:
- Python Bytecode:
*.pyc,*.pyo - Compiled Libraries:
*.so,*.dylib,*.dll - Executables:
*.exe - Object Files:
*.o,*.a
Standard Mode: .git directories are preserved and excluded from cleaning.
Sanitize Mode (--sanitize-git): When explicitly enabled, the script:
- Removes remote repository URLs
- Unsets local user.name and user.email
- Preserves commit history and branches
- Creates backup of git config
Problem: Script reports missing exiftool or python3
# Check dependencies
command -v exiftool && echo "exiftool: OK" || echo "exiftool: MISSING"
command -v python3 && echo "python3: OK" || echo "python3: MISSING"
command -v mat2 && echo "mat2: OK (optional)" || echo "mat2: MISSING (optional)"
# Install on Debian/Ubuntu
sudo apt install -y libimage-exiftool-perl python3 mat2
# Install on macOS
brew install exiftool python3Problem: Cannot write to backup directory or target files
# Check write permissions
touch ./test_write && rm ./test_write || echo "No write permission"
# Check target file permissions
ls -la /path/to/file.pdf
# Fix permissions if needed
chmod u+w /path/to/file.pdfProblem: Post-sanitization audit reports remaining sensitive data
Solution:
# Re-run with aggressive mode
./metadata_cleaner.sh -r -a --report audit.txt /path/to/files
# Check specific warnings
grep "AUDIT WARNING" audit.txt
# Manually inspect flagged files
less /path/to/flagged/file.txtNote: False positives are common (e.g., filenames like password_generator.py or IP addresses in documentation).
Problem: Insufficient disk space or backup creation fails
# Check available disk space
df -h .
# Use custom backup location with more space
./metadata_cleaner.sh -b --backup-dir /mnt/external/backup /path/to/files
# Skip compression if running low on CPU
./metadata_cleaner.sh -b /path/to/files # without -cProblem: Script won't run or shows syntax errors
# Verify script syntax
bash -n metadata_cleaner.sh
# Check bash version (requires 4.0+)
bash --version
# Run with verbose output for debugging
./metadata_cleaner.sh -v file.jpgProblem: Parallel mode crashes or produces errors
Solution:
# Reduce number of parallel jobs
./metadata_cleaner.sh -r -p -j 2 /path/to/files
# Disable parallel processing
./metadata_cleaner.sh -r /path/to/files # without -p
# Check system resources
topEnable verbose output for detailed troubleshooting:
# Verbose + dry-run to see exactly what will happen
./metadata_cleaner.sh -v -d -r /path/to/files
# Verbose + live mode for detailed execution log
./metadata_cleaner.sh -v -r /path/to/files 2>&1 | tee debug.logIf you encounter persistent issues:
- Run with
-v(verbose) and save output:./metadata_cleaner.sh -v ... 2>&1 | tee error.log - Check the Contributing Guide for bug report guidelines
- Search existing issues on GitHub
- Create a new issue with:
- Your OS and bash version
- Complete error messages
- Steps to reproduce
- Output from verbose mode
Contributions are welcome! Please read CONTRIBUTING.md for:
- How to report bugs and security issues
- How to submit feature requests
- Pull request process and guidelines
- Code standards and style guide
- Testing requirements
# Clone repository
git clone https://github.com/[username]/bash-scripts.git
cd bash-scripts
# Make changes to metadata_cleaner.sh
# Test with dry-run
./metadata_cleaner.sh -d -v -r test_directory/
# Validate syntax
bash -n metadata_cleaner.sh
# Run comprehensive test
./metadata_cleaner.sh -r -b -a --report test_report.txt test_directory/This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
You are free to:
- ✅ Share — copy and redistribute the material
- ✅ Adapt — remix, transform, and build upon the material
Under the following terms:
- 📝 Attribution — You must give appropriate credit
- 🚫 NonCommercial — You may not use for commercial purposes
- 🔄 ShareAlike — Distribute under the same license
See the LICENSE file for complete details.
Built with:
- ExifTool by Phil Harvey - Comprehensive metadata manipulation
- MAT2 - Metadata removal toolkit
- Bash scripting best practices from the community
ulpati
- GitHub: @ulpati
- Repository: bash-scripts
Privacy matters. Clean your metadata! 🛡️
Last updated: December 2025