Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

🛡️ Metadata Cleaner

Professional file sanitization tool for privacy-conscious users

License Bash Platform Maintenance

📋 Table of Contents

🎯 Overview

Metadata Cleaner is a comprehensive file sanitization tool that removes all identifying metadata from files and directories before sharing them on any platform. Protect your privacy when uploading to cloud storage, sharing files, publishing datasets, or collaborating online.

Why Metadata Cleaner?

  • Universal Privacy Protection: Works with any file sharing platform or cloud storage
  • Comprehensive Cleaning: Removes EXIF, ID3, document metadata, execution history, and more
  • Post-Sanitization Audit: Automatic verification ensures no sensitive data remains
  • Production Ready: Robust error handling, integrity checks, and detailed reporting
  • Cross-Platform: Works on Linux and macOS with portable shell scripting

✨ Features

Core Capabilities

  • 🎯 Comprehensive Metadata Removal

    • EXIF data from images (GPS, camera info, timestamps)
    • ID3 tags from audio files (artist, album, year)
    • Document metadata (author, software, creation dates)
    • Jupyter notebook execution history and outputs
    • Filesystem extended attributes
    • Author/date comments in text files
  • 🔒 Aggressive Sensitive Data Protection

    • Detects and removes passwords, secrets, API keys, tokens
    • Sanitizes email addresses → [EMAIL_REMOVED]
    • Neutralizes absolute paths: /home/username//home/user/
    • Masks private IPs: 192.168.x.xXXX.XXX.XXX.XXX
    • Cleans version control metadata (Git)
  • 🔍 Automated Post-Sanitization Audit

    • Verifies no sensitive data remains after cleaning
    • Scans for patterns, emails, paths, and private IPs
    • Generates detailed warnings with file locations
    • Confirms files are safe for public sharing
  • 💾 Smart Backup & Verification

    • Optional compressed backups (tar.gz)
    • SHA256 checksum generation for integrity verification
    • Timestamped backups enable multiple restore points
    • File integrity checks (size, format validation, JSON parsing)
  • ⚡ Performance & Reliability

    • Parallel processing support for large directories
    • Cross-platform compatibility (Linux and macOS)
    • Dry-run mode to preview changes safely
    • Detailed reporting and statistics

🚀 Installation

Prerequisites

Required:

# Debian/Ubuntu
sudo apt install -y libimage-exiftool-perl python3

# Fedora/RHEL
sudo dnf install -y perl-Image-ExifTool python3

# macOS (Homebrew)
brew install exiftool python3

Optional (Recommended):

# mat2 provides enhanced cleaning for documents and archives
# Debian/Ubuntu
sudo apt install -y mat2

# macOS
pip3 install mat2

Setup

# Clone repository
git clone https://github.com/[username]/bash-scripts.git
cd bash-scripts

# Make executable
chmod +x metadata_cleaner.sh

# Verify installation
./metadata_cleaner.sh --help

# Optional: Install to system path
sudo cp metadata_cleaner.sh /usr/local/bin/metadata-cleaner

Quick Download (Standalone)

# Download script directly
wget https://raw.githubusercontent.com/[username]/bash-scripts/main/metadata_cleaner.sh

# Or using curl
curl -O https://raw.githubusercontent.com/[username]/bash-scripts/main/metadata_cleaner.sh

# Make executable
chmod +x metadata_cleaner.sh

💻 Usage

Quick Start

# Clean a single file
./metadata_cleaner.sh document.pdf

# Clean directory recursively
./metadata_cleaner.sh -r /path/to/directory

# Preview changes (dry-run)
./metadata_cleaner.sh -d -r /path/to/directory

# Full sanitization with backup
./metadata_cleaner.sh -r -b -a /path/to/files

For detailed usage, run:

./metadata_cleaner.sh --help

📖 Command Reference

Options

-h, --help             Show help message
-v, --verbose          Enable verbose output
-d, --dry-run          Preview changes without modifying files
-r, --recursive        Process directories recursively
-b, --backup           Create backup before cleaning
--backup-dir <path>    Specify custom backup directory (default: ./metadata_backup)
-c, --compress         Compress backup
-a, --aggressive       Remove sensitive data patterns (aggressive mode)
--sanitize-git         Clean version control metadata (remove remotes, unset local user info)
--checksums            Generate SHA256 hashes (added to report when enabled)
--report <file>        Generate detailed report to specified file
-p, --parallel         Enable parallel processing (uses xargs -P)
-j, --jobs <n>         Number of parallel jobs (default: 4)

📖 Examples

Common Workflows

1. Preview Before Cleaning (Recommended)

# Always preview first to see what will be changed
./metadata_cleaner.sh -d -r /path/to/share

2. Safe File Sharing Preparation

# Create backup + aggressive cleaning + verification
./metadata_cleaner.sh -r -b -a /path/to/share

3. Single Document Sanitization

# Clean presentation with backup
./metadata_cleaner.sh -b -a presentation.pptx

4. Photo Album Cleaning

# Preview first, then clean
./metadata_cleaner.sh -d -r photos/
./metadata_cleaner.sh -r -b -a photos/

5. Academic Dataset Preparation

# Comprehensive cleaning with full audit trail
./metadata_cleaner.sh -r -b \
  --backup-dir ./backups/dataset \
  -c --checksums \
  --report dataset_audit.txt \
  -a research_data/

6. Complete Project Sanitization

# Maximum safety: backup + compress + aggressive + git sanitization + report
./metadata_cleaner.sh -r -b -c -a \
  --sanitize-git \
  --checksums \
  --report full_audit.txt \
  /path/to/project

7. Parallel Processing for Large Directories

# Use 8 parallel jobs for faster processing
./metadata_cleaner.sh -r -p -j 8 -b -a /large/directory

8. Version Control Repository Cleaning

# Remove Git metadata and user information
./metadata_cleaner.sh -r -a --sanitize-git repo/

Important Notes

  • Always test with -d (dry-run) first to preview changes
  • Use -b (backup) before cleaning important files
  • Combine -c with -b for compressed backups (saves space)
  • Use -a (aggressive) to remove sensitive data patterns from file contents
  • Quote paths with spaces: "./path with spaces/"
  • Dry-run skips audit: The post-sanitization audit only runs in live mode

🔍 What Gets Cleaned

Supported File Types

Images

Formats: JPEG, PNG, TIFF, GIF, BMP, WebP

Removed Metadata:

  • GPS coordinates and location data
  • Camera make, model, and settings
  • Software and editing information
  • Creation and modification timestamps
  • Author and copyright information

Audio/Video

Formats: MP3, MP4, AVI, MOV, WAV

Removed Metadata:

  • Artist, album, and track information
  • Recording location and timestamps
  • Encoding software details
  • Comments and descriptions

Documents

Formats: PDF, DOCX, XLSX, PPTX, ODT

Removed Metadata:

  • Author and organization information
  • Creation and modification software
  • Document statistics (edit time, revisions)
  • Template information
  • Hidden text and comments

Jupyter Notebooks

Format: .ipynb

Removed Metadata:

  • Cell execution counts
  • Cell outputs and results
  • Kernel information and session data
  • Notebook-level metadata
  • User-specific paths

Text Files

Formats: Markdown, Python, SQL, CSV, JSON, YAML, Shell scripts

Removed Metadata:

  • Author comments (# Author: ...)
  • Date comments (# Date: ..., # Created: ...)
  • Modification history comments
  • Sensitive data patterns (aggressive mode)

All Files

Universal Cleaning:

  • Filesystem timestamps → set to 2000-01-01
  • Extended attributes (xattrs)
  • File comments and annotations

Cleaning Modes

Standard Mode

Removes all metadata from file headers and properties using exiftool and mat2.

Aggressive Mode (-a)

Additionally scans and removes:

  • Email addresses → [EMAIL_REMOVED]
  • Absolute paths → /home/user/
  • Private IP addresses → XXX.XXX.XXX.XXX
  • Sensitive keywords: password, secret, api_key, token, credential

Important Limitations

⚠️ Complex Document Metadata: Revision history and tracked changes in DOCX files may require manual removal or re-export.

⚠️ Embedded Content: Some PDFs contain embedded metadata in images or forms that may not be fully removed.

⚠️ Permissions: File permissions and ACLs are preserved and not modified by the script.

⚠️ Binary Files: Executable files and compiled binaries are automatically excluded from cleaning.

Best Practices

  1. Always preview first: Use -d (dry-run) to see what will be changed
  2. Create backups: Use -b before cleaning important files
  3. Verify results: Use exiftool or grep to manually verify critical files
  4. Use aggressive mode: Add -a for maximum privacy protection
  5. Review audit report: Check warnings for any remaining sensitive data

📁 Project Structure

bash-scripts/
├── README.md                      # Main repository documentation
├── LICENSE                        # CC BY-NC-SA 4.0 license
├── CONTRIBUTING.md                # Contribution guidelines
├── metadata_cleaner.sh            # Main sanitization script
├── metadata_cleaner.md            # This documentation
├── publish_repo.sh                # Repository publishing script
├── publish_repo.md                # Publishing documentation
└── life_game/                     # Conway's Game of Life implementation
    ├── life_game.sh
    ├── install.sh
    ├── README.md
    ├── modules/
    └── data/

📊 Reports & Verification

Detailed Audit Reports

Generate comprehensive reports using the --report flag:

./metadata_cleaner.sh -r -a --report audit.txt /path/to/files

Report Contents

Summary Statistics:

  • Total files processed
  • Successfully cleaned files
  • Skipped files (with reasons)
  • Failed files (with errors)
  • Sensitive data items removed
  • Audit warnings and their severity
  • Data volume processed
  • Processing time

File-Level Details:

  • SHA256 checksums (when --checksums enabled)
  • Individual file processing status
  • Specific issues detected per file
  • Post-sanitization audit results

Audit Warnings:

  • Remaining sensitive patterns with file locations
  • Email addresses found
  • Absolute paths detected
  • Private IP addresses

Example Report Output

========================================
Metadata Cleaner Report
========================================
Date: 2025-12-05 14:30:15
Target: /home/user/documents
Mode: LIVE
Recursive: true
Backup: true
Aggressive: true
========================================

=== FILE CHECKSUMS (SHA256) ===
SHA256: a3f5d8e... | /home/user/documents/file1.pdf
SHA256: b7c9e2f... | /home/user/documents/file2.docx
...

========================================
Final Statistics:
========================================
Total files found:     657
Successfully cleaned:  621
Skipped:              36 (excluded/binary)
Failed:               0
Sensitive data items:  39
Audit warnings:       0 (PASSED)
Data processed:       ~100 MB
Time elapsed:         0m 3s
========================================

=== POST-SANITIZATION AUDIT ===
[AUDIT RESULT] PASSED - No sensitive data detected

Verification Commands

# Generate and review audit report
./metadata_cleaner.sh -r -a --report audit.txt --checksums files/

# Check for remaining issues
grep "AUDIT WARNING" audit.txt

# Verify checksums
grep "SHA256:" audit.txt

# Manual verification with exiftool
exiftool cleaned_file.jpg

# Search for specific patterns
grep -r "password\|secret" cleaned_directory/

Audit Process

The post-sanitization audit automatically:

  1. Scans all cleaned files for sensitive patterns
  2. Checks for email addresses
  3. Detects absolute paths with usernames
  4. Identifies private IP addresses
  5. Verifies file integrity
  6. Generates warnings with specific file locations
  7. Provides PASS/FAIL status

🚫 Automatic Exclusions

Excluded Directories

These directories are automatically skipped to avoid interfering with development environments:

  • Version Control: .git
  • Dependencies: node_modules
  • Python: __pycache__, .venv, venv, env, .env
  • Build Artifacts: dist, build
  • Test/Cache: .pytest_cache, .mypy_cache, .tox

Excluded File Types

These file types are automatically skipped:

  • Python Bytecode: *.pyc, *.pyo
  • Compiled Libraries: *.so, *.dylib, *.dll
  • Executables: *.exe
  • Object Files: *.o, *.a

Git Repository Handling

Standard Mode: .git directories are preserved and excluded from cleaning.

Sanitize Mode (--sanitize-git): When explicitly enabled, the script:

  • Removes remote repository URLs
  • Unsets local user.name and user.email
  • Preserves commit history and branches
  • Creates backup of git config

🔧 Troubleshooting

Common Issues

1. Missing Dependencies

Problem: Script reports missing exiftool or python3

# Check dependencies
command -v exiftool && echo "exiftool: OK" || echo "exiftool: MISSING"
command -v python3 && echo "python3: OK" || echo "python3: MISSING"
command -v mat2 && echo "mat2: OK (optional)" || echo "mat2: MISSING (optional)"

# Install on Debian/Ubuntu
sudo apt install -y libimage-exiftool-perl python3 mat2

# Install on macOS
brew install exiftool python3

2. Permission Errors

Problem: Cannot write to backup directory or target files

# Check write permissions
touch ./test_write && rm ./test_write || echo "No write permission"

# Check target file permissions
ls -la /path/to/file.pdf

# Fix permissions if needed
chmod u+w /path/to/file.pdf

3. Audit Warnings

Problem: Post-sanitization audit reports remaining sensitive data

Solution:

# Re-run with aggressive mode
./metadata_cleaner.sh -r -a --report audit.txt /path/to/files

# Check specific warnings
grep "AUDIT WARNING" audit.txt

# Manually inspect flagged files
less /path/to/flagged/file.txt

Note: False positives are common (e.g., filenames like password_generator.py or IP addresses in documentation).

4. Backup Fails

Problem: Insufficient disk space or backup creation fails

# Check available disk space
df -h .

# Use custom backup location with more space
./metadata_cleaner.sh -b --backup-dir /mnt/external/backup /path/to/files

# Skip compression if running low on CPU
./metadata_cleaner.sh -b /path/to/files  # without -c

5. Script Syntax Errors

Problem: Script won't run or shows syntax errors

# Verify script syntax
bash -n metadata_cleaner.sh

# Check bash version (requires 4.0+)
bash --version

# Run with verbose output for debugging
./metadata_cleaner.sh -v file.jpg

6. Parallel Processing Issues

Problem: Parallel mode crashes or produces errors

Solution:

# Reduce number of parallel jobs
./metadata_cleaner.sh -r -p -j 2 /path/to/files

# Disable parallel processing
./metadata_cleaner.sh -r /path/to/files  # without -p

# Check system resources
top

Debug Mode

Enable verbose output for detailed troubleshooting:

# Verbose + dry-run to see exactly what will happen
./metadata_cleaner.sh -v -d -r /path/to/files

# Verbose + live mode for detailed execution log
./metadata_cleaner.sh -v -r /path/to/files 2>&1 | tee debug.log

Getting Help

If you encounter persistent issues:

  1. Run with -v (verbose) and save output: ./metadata_cleaner.sh -v ... 2>&1 | tee error.log
  2. Check the Contributing Guide for bug report guidelines
  3. Search existing issues on GitHub
  4. Create a new issue with:
    • Your OS and bash version
    • Complete error messages
    • Steps to reproduce
    • Output from verbose mode

🤝 Contributing

Contributions are welcome! Please read CONTRIBUTING.md for:

  • How to report bugs and security issues
  • How to submit feature requests
  • Pull request process and guidelines
  • Code standards and style guide
  • Testing requirements

Development

# Clone repository
git clone https://github.com/[username]/bash-scripts.git
cd bash-scripts

# Make changes to metadata_cleaner.sh

# Test with dry-run
./metadata_cleaner.sh -d -v -r test_directory/

# Validate syntax
bash -n metadata_cleaner.sh

# Run comprehensive test
./metadata_cleaner.sh -r -b -a --report test_report.txt test_directory/

📄 License

This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

You are free to:

  • ✅ Share — copy and redistribute the material
  • ✅ Adapt — remix, transform, and build upon the material

Under the following terms:

  • 📝 Attribution — You must give appropriate credit
  • 🚫 NonCommercial — You may not use for commercial purposes
  • 🔄 ShareAlike — Distribute under the same license

See the LICENSE file for complete details.

🙏 Acknowledgements

Built with:

  • ExifTool by Phil Harvey - Comprehensive metadata manipulation
  • MAT2 - Metadata removal toolkit
  • Bash scripting best practices from the community

👤 Author

ulpati


Privacy matters. Clean your metadata! 🛡️

Last updated: December 2025