Skip to content

ragilmalik/Python-GUI-Duplicate-File-Finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Python-GUI-Duplicate-File-Finder

Ragilmalik’s Duplicate File Finder

A fast, no‑nonsense Windows GUI for finding duplicate files in a folder (non‑recursive by default). It hashes files with SHA‑256, moves duplicates to a Duplicates/ folder next to the source, and writes a clean Excel log you can hand to anyone.

Program name in UI: Ragilmalik’s Duplicate File Finder


Highlights

  • One-click scan. Point it at a folder and go. Default is current folder only; turn on Recursive if you want subfolders.
  • Solid matching. Files are matched by SHA‑256. File size/mtime are only used for cache validation, not for matching.
  • Smart “original” pick. If multiple files hash the same:
    • Prefer a name without a trailing suffix like ~(n), -(n), (n), or _n.
    • Otherwise choose the one with the smallest suffix number.
  • Safe by default. Duplicates are moved (not deleted) into Duplicates/ inside the source folder.
  • Excel log (XLSX). Columns: Folder, Date and Time (DD-MM-YYYY HH-MM-SS), File Size, Filename, Filename 2, … up to the max duplicates found in that run.
  • Screen Log. Live table shows Filename + Duplicates found. Columns are resizable and fill the available width.
  • ETA and progress. ETA in HH:MM:SS. Progress bar fades from red → green as it completes.
  • Hash cache. SQLite cache keyed by path/size/mtime; unchanged files are never re-hashed across runs.
  • Simulation mode. “Simulation only” runs show what would be moved without touching the filesystem.
  • Rollback. Generate a CSV of moves and restore later with Execute Rollback.
  • Themes. Dark (pure black) and Light (pure white). Cyan outlines on Dark, blue on Light. Dropdowns open by clicking anywhere.
  • Multithreaded hashing. Uses half of your logical cores by default (keeps the system responsive).

Tested on Windows 10/11 with Python 3.9+.


Install

Global install (no virtualenv). If you prefer virtualenv/conda, it works the same.

python -m pip install --upgrade pip
python -m pip install PySide6 openpyxl pyinstaller

Run from source

python "ragilmalik_duplicate_file_finder.py"

Quick start

  1. Source FolderChoose folder (defaults to the current working directory when launched).
  2. Log Folder Location → keep Default (Same folder as source folder) or pick a custom folder.
  3. Filter ModeInclude only or Exclude.
  4. File type box → leave * (all files) or set extensions like .jpg, .png, .zip (comma or semicolon separated).
  5. (Optional) Recursive → include subfolders.
  6. Click Run (or Simulation only to dry-run).
  7. Watch the Screen Log and ETA. Click Stop any time.

Results:

  • Duplicates are moved to Source\Duplicates\ (or just listed in Simulation).
  • An Excel log is saved as Duplicates_Log_REAL_*.xlsx (or _SIM_*.xlsx).

Build a single .exe (PyInstaller)

pyinstaller --onefile --noconsole --clean --name "Ragilmalik_Duplicate_File_Finder" "ragilmalik_duplicate_file_finder.py"

The built EXE will be in dist\Ragilmalik_Duplicate_File_Finder.exe.

Tip: If you distribute the EXE, zip it together with a short README and a sample XLSX to reduce SmartScreen questions.


How duplicates are chosen

  1. Compute SHA‑256 for every file (multithreaded).
  2. Group files by hash.
  3. Pick the “original” in this order:
    • A filename without a trailing number wins (no ~(2), -(3), (4), _5, etc.).
    • Otherwise, the smallest suffix number wins.
    • Ties resolve lexicographically.
  4. Every other file in the group is a duplicate and gets moved into Duplicates/ (unless Simulation is on).

The app never deletes files. It only moves them.


Log output (XLSX)

  • Sheet: Duplicates Log
  • Columns:
    • Folder (full path to the original’s parent)
    • Date and Time (DD-MM-YYYY HH-MM-SS)
    • File Size (kB/MB, base 1024)
    • Filename (original)
    • Filename 2, Filename 3, … one column per duplicate, up to the max found that run.

Rollback

  • Generate Rollback creates a CSV with columns: original_path, moved_path for the last real run.
  • Execute Rollback reads that CSV and moves files back. If the original name is taken, it’ll write (restoredN) next to the filename.

Keep rollback CSVs in the same folder as the log for easy bookkeeping.


Performance notes

  • Threads: uses (CPU logical cores) / 2 to keep the UI responsive.
  • Hash cache: file path + size + mtime are cached; unchanged files are never re-hashed. Cache lives at:
    Source\.ragilmalik_hash_cache.sqlite3
  • Disks: hashing is I/O bound. NVMe helps. Consider excluding huge media/archives when doing exploratory scans.
  • Antivirus: large parallel reads can trigger AV scanning. If scans are slow, add a temporary AV exclusion for the source during testing.

Theme & UI details

  • Dark theme: black background, white text, cyan outlines.
  • Light theme: white background, black text, blue outlines.
  • Group frames: subtle 1 px grey outer border; controls keep the cyan/blue outline.
  • Dropdowns: click anywhere on the box to open; Light theme uses a white arrow; popup list matches theme with readable text.
  • Progress bar: text always centered; color transitions red → green.
  • Screen Log: header gets a bottom border in the outline color; a vertical divider on the right of Filename. Rows/columns have a fine 1 px grid.

Troubleshooting

  • Nothing happens on Run → Pick a Source Folder first.
  • Missing dependencypip install PySide6 openpyxl
  • “Failed to execute script” after building → run the script directly to see the real traceback.
  • GUI looks cramped → on high DPI, try Windows display scaling 125–150% or increase app font size in Windows settings.
  • Very large folders → try Simulation only first, then narrow with extensions, then run for real.

Development

The code is a single file GUI built on PySide6 with a worker thread for hashing (plus a thread pool for parallel reads). The UI has a few quality-of-life touches (centered combo text, themed popup lists, header/column sizing helpers).

If you want to add a new export format or a “delete duplicates” action, wire it behind a confirmation step and keep the rollback feature intact.


License

Choose what fits your project. If you’re unsure, MIT is a good default.


Credits

Screenshot_1

Built for fast day‑to‑day duplicate cleanup with clear logs, safe moves, and easy rollback.

About

A fast, no‑nonsense Windows GUI for finding duplicate files in a folder (non‑recursive by default). It hashes files with **SHA‑256**, moves duplicates to a `Duplicates/` folder next to the source, and writes a clean Excel log you can hand to anyone.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors