A fast, no‑nonsense Windows GUI for finding duplicate files in a folder (non‑recursive by default). It hashes files with SHA‑256, moves duplicates to a Duplicates/ folder next to the source, and writes a clean Excel log you can hand to anyone.
Program name in UI: Ragilmalik’s Duplicate File Finder
- One-click scan. Point it at a folder and go. Default is current folder only; turn on Recursive if you want subfolders.
- Solid matching. Files are matched by SHA‑256. File size/mtime are only used for cache validation, not for matching.
- Smart “original” pick. If multiple files hash the same:
- Prefer a name without a trailing suffix like
~(n),-(n),(n), or_n. - Otherwise choose the one with the smallest suffix number.
- Prefer a name without a trailing suffix like
- Safe by default. Duplicates are moved (not deleted) into
Duplicates/inside the source folder. - Excel log (XLSX). Columns:
Folder,Date and Time (DD-MM-YYYY HH-MM-SS),File Size,Filename,Filename 2, … up to the max duplicates found in that run. - Screen Log. Live table shows
Filename+Duplicates found. Columns are resizable and fill the available width. - ETA and progress. ETA in
HH:MM:SS. Progress bar fades from red → green as it completes. - Hash cache. SQLite cache keyed by path/size/mtime; unchanged files are never re-hashed across runs.
- Simulation mode. “Simulation only” runs show what would be moved without touching the filesystem.
- Rollback. Generate a CSV of moves and restore later with Execute Rollback.
- Themes. Dark (pure black) and Light (pure white). Cyan outlines on Dark, blue on Light. Dropdowns open by clicking anywhere.
- Multithreaded hashing. Uses half of your logical cores by default (keeps the system responsive).
Tested on Windows 10/11 with Python 3.9+.
Global install (no virtualenv). If you prefer virtualenv/conda, it works the same.
python -m pip install --upgrade pip
python -m pip install PySide6 openpyxl pyinstallerpython "ragilmalik_duplicate_file_finder.py"- Source Folder → Choose folder (defaults to the current working directory when launched).
- Log Folder Location → keep Default (Same folder as source folder) or pick a custom folder.
- Filter Mode → Include only or Exclude.
- File type box → leave
*(all files) or set extensions like.jpg, .png, .zip(comma or semicolon separated). - (Optional) Recursive → include subfolders.
- Click Run (or Simulation only to dry-run).
- Watch the Screen Log and ETA. Click Stop any time.
Results:
- Duplicates are moved to
Source\Duplicates\(or just listed in Simulation). - An Excel log is saved as
Duplicates_Log_REAL_*.xlsx(or_SIM_*.xlsx).
pyinstaller --onefile --noconsole --clean --name "Ragilmalik_Duplicate_File_Finder" "ragilmalik_duplicate_file_finder.py"The built EXE will be in dist\Ragilmalik_Duplicate_File_Finder.exe.
Tip: If you distribute the EXE, zip it together with a short README and a sample XLSX to reduce SmartScreen questions.
- Compute SHA‑256 for every file (multithreaded).
- Group files by hash.
- Pick the “original” in this order:
- A filename without a trailing number wins (no
~(2),-(3),(4),_5, etc.). - Otherwise, the smallest suffix number wins.
- Ties resolve lexicographically.
- A filename without a trailing number wins (no
- Every other file in the group is a duplicate and gets moved into
Duplicates/(unless Simulation is on).
The app never deletes files. It only moves them.
- Sheet:
Duplicates Log - Columns:
Folder(full path to the original’s parent)Date and Time (DD-MM-YYYY HH-MM-SS)File Size(kB/MB, base 1024)Filename(original)Filename 2,Filename 3, … one column per duplicate, up to the max found that run.
- Generate Rollback creates a CSV with columns:
original_path,moved_pathfor the last real run. - Execute Rollback reads that CSV and moves files back. If the original name is taken, it’ll write
(restoredN)next to the filename.
Keep rollback CSVs in the same folder as the log for easy bookkeeping.
- Threads: uses
(CPU logical cores) / 2to keep the UI responsive. - Hash cache: file path + size + mtime are cached; unchanged files are never re-hashed. Cache lives at:
Source\.ragilmalik_hash_cache.sqlite3 - Disks: hashing is I/O bound. NVMe helps. Consider excluding huge media/archives when doing exploratory scans.
- Antivirus: large parallel reads can trigger AV scanning. If scans are slow, add a temporary AV exclusion for the source during testing.
- Dark theme: black background, white text, cyan outlines.
- Light theme: white background, black text, blue outlines.
- Group frames: subtle 1 px grey outer border; controls keep the cyan/blue outline.
- Dropdowns: click anywhere on the box to open; Light theme uses a white arrow; popup list matches theme with readable text.
- Progress bar: text always centered; color transitions red → green.
- Screen Log: header gets a bottom border in the outline color; a vertical divider on the right of Filename. Rows/columns have a fine 1 px grid.
- Nothing happens on Run → Pick a Source Folder first.
- Missing dependency →
pip install PySide6 openpyxl - “Failed to execute script” after building → run the script directly to see the real traceback.
- GUI looks cramped → on high DPI, try Windows display scaling 125–150% or increase app font size in Windows settings.
- Very large folders → try Simulation only first, then narrow with extensions, then run for real.
The code is a single file GUI built on PySide6 with a worker thread for hashing (plus a thread pool for parallel reads). The UI has a few quality-of-life touches (centered combo text, themed popup lists, header/column sizing helpers).
If you want to add a new export format or a “delete duplicates” action, wire it behind a confirmation step and keep the rollback feature intact.
Choose what fits your project. If you’re unsure, MIT is a good default.
Built for fast day‑to‑day duplicate cleanup with clear logs, safe moves, and easy rollback.