URL normalization library for creating consistent URL representations.
The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.
pip install tk-normalizerfrom tk_normalizer import TkNormalizer
# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized)) # Output: example.com/path?a=1&b=2
# Get full details with dict()
print(dict(normalized)) # Returns all fields including query_string, path, and hashesThe following URLs all normalize to the same normalized form:
https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign
All normalize to: example.com
URLs are normalized through the following steps:
- ✅ Protocol and www subdomains removed
- ✅ Lowercased
- ✅ Trailing slashes removed
- ✅ Query parameters reordered alphabetically by key
- ✅ Duplicate query parameter key/value pairs removed
- ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
- ✅ Non-HTTP(S) protocols rejected
- ✅ Localhost URLs rejected
The following tracking parameters are automatically removed during normalization:
utm_*(all utm parameters)gclid,fbclid,dclid(click identifiers)_ga,_gid,_fbp,_hjid(analytics cookies)msclkid(Microsoft Ads)aff_id,affid(affiliate tracking)referrer,adgroupid,srsltid
from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")
# Use str() for just the normalized URL
print(str(normalizer)) # blog.example.com/page?a=1&b=2
# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
# 'normalized_url': 'blog.example.com/page?a=1&b=2',
# 'parent_normalized_url': 'blog.example.com',
# 'root_normalized_url': 'example.com',
# 'query_string': 'a=1&b=2',
# 'path': '/page',
# 'normalized_url_hash': '...',
# 'parent_normalized_url_hash': '...',
# 'root_normalized_url_hash': '...'
# }from tk_normalizer import TkNormalizer, InvalidUrlException
try:
normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
print(f"Invalid URL: {e}")from tk_normalizer import TkNormalizer
normalizer = TkNormalizer("https://blog.example.com/path?a=1")
# Dict-like access to individual fields
print(normalizer["normalized_url"]) # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"]) # example.com
print(normalizer["query_string"]) # a=1
print(normalizer["path"]) # /path
# Iterate over available fields
for key in normalizer:
print(f"{key}: {normalizer[key]}")
# Get all field names
print(normalizer.keys())For efficient storage and comparison, SHA-256 hashes are computed for:
- The normalized URL
- The parent normal URL (domain without path)
- The root normal URL (root domain without subdomains)
This provides fixed-length representations suitable for database indexing.
While this normalization process works well for most use cases, there are some limitations:
-
www subdomain removal: Technically,
www.example.comandexample.comcould serve different content, though this is rare in practice. -
Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.
-
Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.
-
Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.
# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=tk_normalizer
# Run linting
ruff check src tests# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test file
pytest tests/test_normalizer.py
# Run with coverage report
pytest --cov=tk_normalizer --cov-report=htmlContributions are welcome! Please feel free to submit a Pull Request.
We have a workflow set up to deploy to our PYPI package when a release is created. Here is how you can do that:
- Cut a PR for your change
- Make sure you increment your version number in the pyproject.toml file in your changes
- After approval, merge changes to main branch
- Cut a new release in GitHub
- this can be found on the right hand side of the screen when you are at the repo's home page
- you should see the current release
- For consistency in the release:
- create a new tag that matches the version number you changed earlier
- add a title with a brief description of the changes
- add a small description or link to JIRA tickets for updates
- After creating the release you should see a workflow get triggered, this will deploy the updated version to pypi
- If you want to see check the pypi package page after the workflow completes running
NOTE: DO NOT change the name of the workflow file. If you do the deployment will not work unless we update the configuration in PYPI under trusted publishers
If you have questions or concerns reach out.
The normalizer implementaiton in snowflake is defined at TERAKEET.COMMON.NORMALIZE_URL. Once PYPI has been deployed, another workflow will run and update the UDF in Snowflake. That UDF needs the new version of PYPI and will just override the current function. This happens instantly after PYPI has been deployed.
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions, please use the GitHub issue tracker.
Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.