Skip to content

terakeet/tk-normalizer

Repository files navigation

tk-normalizer

Python PyPI License: MIT

URL normalization library for creating consistent URL representations.

Purpose

The URL normalization process creates a mechanism to provide equivalence between URLs with varying string, protocol, scheme, and query parameter ordering. This library helps create normalized representations of URLs for consistent storage, comparison, and analysis.

Installation

pip install tk-normalizer

Quick Start

from tk_normalizer import TkNormalizer

# Simple usage - str() returns just the normalized URL
normalized = TkNormalizer("http://www.Example.com/path?b=2&a=1&utm_source=test")
print(str(normalized))  # Output: example.com/path?a=1&b=2

# Get full details with dict()
print(dict(normalized))  # Returns all fields including query_string, path, and hashes

Features

URL Normalization

The following URLs all normalize to the same normalized form:

https://example.com/
http://www.example.com/
http://www.example.com
http://www.example.com/#my_search_engine_is_great
https://www.example.com/?utm_campaign=SomeGoogleCampaign
https://www.example.com/?utm_source=because&utm_campaign=SomeGoogleCampaign

All normalize to: example.com

Normalization Process

URLs are normalized through the following steps:

  • ✅ Protocol and www subdomains removed
  • ✅ Lowercased
  • ✅ Trailing slashes removed
  • ✅ Query parameters reordered alphabetically by key
  • ✅ Duplicate query parameter key/value pairs removed
  • ✅ Common tracking parameters removed (utm_*, gclid, fbclid, etc.)
  • ✅ Non-HTTP(S) protocols rejected
  • ✅ Localhost URLs rejected

Tracking Parameters Removed

The following tracking parameters are automatically removed during normalization:

  • utm_* (all utm parameters)
  • gclid, fbclid, dclid (click identifiers)
  • _ga, _gid, _fbp, _hjid (analytics cookies)
  • msclkid (Microsoft Ads)
  • aff_id, affid (affiliate tracking)
  • referrer, adgroupid, srsltid

Advanced Usage

Getting Full Normalization Details

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("http://blog.example.com/page?b=2&a=1")

# Use str() for just the normalized URL
print(str(normalizer))  # blog.example.com/page?a=1&b=2

# Use dict() for complete normalization data
result = dict(normalizer)
print(result)
# {
#   'normalized_url': 'blog.example.com/page?a=1&b=2',
#   'parent_normalized_url': 'blog.example.com',
#   'root_normalized_url': 'example.com',
#   'query_string': 'a=1&b=2',
#   'path': '/page',
#   'normalized_url_hash': '...',
#   'parent_normalized_url_hash': '...',
#   'root_normalized_url_hash': '...'
# }

Error Handling

from tk_normalizer import TkNormalizer, InvalidUrlException

try:
    normalizer = TkNormalizer("not a valid url")
except InvalidUrlException as e:
    print(f"Invalid URL: {e}")

Accessing Individual Components

from tk_normalizer import TkNormalizer

normalizer = TkNormalizer("https://blog.example.com/path?a=1")

# Dict-like access to individual fields
print(normalizer["normalized_url"])       # blog.example.com/path?a=1
print(normalizer["parent_normalized_url"]) # blog.example.com
print(normalizer["root_normalized_url"])   # example.com
print(normalizer["query_string"])          # a=1
print(normalizer["path"])                  # /path

# Iterate over available fields
for key in normalizer:
    print(f"{key}: {normalizer[key]}")

# Get all field names
print(normalizer.keys())

Hashing

For efficient storage and comparison, SHA-256 hashes are computed for:

  • The normalized URL
  • The parent normal URL (domain without path)
  • The root normal URL (root domain without subdomains)

This provides fixed-length representations suitable for database indexing.

Important Caveats

While this normalization process works well for most use cases, there are some limitations:

  1. www subdomain removal: Technically, www.example.com and example.com could serve different content, though this is rare in practice.

  2. Case sensitivity: URLs are lowercased, but some servers are case-sensitive for paths.

  3. Tracking parameters: New tracking parameters emerge over time and may not be in the removal list.

  4. Fragment removal: URL fragments (#anchors) are removed, which may affect single-page applications.

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/terakeet/tk-normalizer.git
cd tk-normalizer

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=tk_normalizer

# Run linting
ruff check src tests

Running Tests

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test file
pytest tests/test_normalizer.py

# Run with coverage report
pytest --cov=tk_normalizer --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Deploying to PYPI

We have a workflow set up to deploy to our PYPI package when a release is created. Here is how you can do that:

  1. Cut a PR for your change
  2. Make sure you increment your version number in the pyproject.toml file in your changes
  3. After approval, merge changes to main branch
  4. Cut a new release in GitHub
    • this can be found on the right hand side of the screen when you are at the repo's home page
    • you should see the current release
  5. For consistency in the release:
    • create a new tag that matches the version number you changed earlier
    • add a title with a brief description of the changes
    • add a small description or link to JIRA tickets for updates
  6. After creating the release you should see a workflow get triggered, this will deploy the updated version to pypi
  7. If you want to see check the pypi package page after the workflow completes running

NOTE: DO NOT change the name of the workflow file. If you do the deployment will not work unless we update the configuration in PYPI under trusted publishers

If you have questions or concerns reach out.

Deploying to Snowflake -- UDF Update

The normalizer implementaiton in snowflake is defined at TERAKEET.COMMON.NORMALIZE_URL. Once PYPI has been deployed, another workflow will run and update the UDF in Snowflake. That UDF needs the new version of PYPI and will just override the current function. This happens instantly after PYPI has been deployed.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues and questions, please use the GitHub issue tracker.

Credits

Based on the URL normalization functionality from tk-core, extracted and packaged for standalone use.

About

Extracted URL normalization from TK-Core

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors