diffhouse is a Python solution for structuring Git metadata, designed to enable large-scale codebase analysis at practical speeds.
Key features are:
- 🚀 Fast access to commit data, file changes and more
- 📊 Easy integration with pandas and Polars
- 🐍 Simple-to-use Python interface
Processing times for tween.js. Lower is better.
For more details, see benchmarks.
| Python | 3.10 or higher |
| Git | 2.22 or higher |
Git also needs to be added to the system PATH.
At its core, diffhouse is a data extraction tool and therefore does not calculate software metrics like code churn or cyclomatic complexity; if this is needed, take a look at PyDriller instead.
This guide aims to cover the basic use cases of diffhouse. For a full list of objects, consider reading the API Reference.
Install diffhouse from PyPI:
pip install diffhouseIf you plan to combine diffhouse with pandas or Polars, install the package with their respective extras:
| pandas | pip install diffhouse[pandas] |
| Polars | pip install diffhouse[polars] |
from diffhouse import Repo
with Repo('https://github.com/user/repo') as r:
for c in r.commits:
print(c.commit_hash[:10], c.date, c.author_email)
if len(r.branches.to_list()) > 100:
print('🎉')
df = r.diffs.to_pandas()To start, create a Repo instance by passing either a Git-hosting URL or a local path as its source argument. Next, use the Repo in a with statement to clone the source into a local, non-persistent
location.
Inside the with block, you can access data through the following properties:
| Property | Description | Record Type |
|---|---|---|
Repo.commits |
Commit history of the repository. | Commit |
Repo.filemods |
File modifications across the commit history. | FileMod |
Repo.diffs |
Source code changes across the commit history. | Diff |
Repo.branches |
Branches of the repository. | Branch |
Repo.tags |
Tags of the repository. | Tag |
Data accessors like Repo.commits are Extractor objects and can output their results in various formats:
You can use extractors in a for loop to process objects one by one. Data will be extracted on demand for memory efficiency:
with Repo('https://github.com/user/repo') as r:
for c in r.commits:
print(c.commit_hash[:10])
print(c.author_name)
if c.in_main:
breakiter_dicts() is a for loop alternative that yields dictionaries instead of diffhouse objects. A good use case for this is writing results into a newline-delimited JSON file:
import json
with (
Repo('https://github.com/user/repo') as r,
open('commits.jsonl', 'w') as f
):
for c in r.commits.iter_dicts():
f.write(json.dumps(c) + '\n')pandas and Polars DataFrame APIs are supported out of the box. To convert result sets to dataframes, call the following methods:
to_pandas()orpd()for pandasto_polars()orpl()for Polars
with Repo('https://github.com/user/repo') as r:
df1 = r.filemods.to_pandas() # pandas
df2 = r.diffs.to_polars() # PolarsYou can filter data along certain dimensions before processing takes place to reduce extraction time and/or network load.
Note
Filters are a WIP feature. Additional options like date and branch filtering are planned for future releases.
If no blob-level data is needed, pass blobs=False when creating the Repo to skip file downloads during cloning. Note that this will not populate:
files_changed,lines_addedandlines_deletedfields ofRepo.commitsRepo.filemodsRepo.diffs
with Repo('https://github.com/user/repo', blobs=False) as r:
for b in r.branches:
pass # business as usual
r.filemods # throws FilterError