A detailed, human-readable breakdown of how the postcode lookup script works, including the mathematics behind the distance calculations.
The script takes ~326,000 London postcodes and ~34 neonatal hospitals, and for each postcode finds the nearest hospital at each care level (1, 2, 3, and any). It does this in three stages:
- Classify each postcode as North or South of the Thames
- Search for the nearest hospital using a spatial data structure (KD-tree)
- Measure the actual distance using the Haversine formula
A UK postcode like SW1A 2AA has two parts:
- Outcode (first half):
SW1A— identifies the area and district - Incode (second half):
2AA— narrows it to a few addresses
The outcode itself has an area code (the leading letters) and a district number:
SW1A
││ │
││ └─ District sub-letter (not all outcodes have this)
│└── District number
└── Area code
The script reads characters from the start of the outcode until it hits a digit:
WC2R → WC (stops at '2')
EC1A → EC (stops at '1')
SW10 → SW (stops at '1')
N1C → N (stops at '1')
BR1 → BR (stops at '1')
This is important because a naïve approach of stripping all digits (e.g. WC2R → WCR) would fail to match against the known prefix WC.
The script uses a lookup set of area codes known to be north of the Thames:
N, NW, E, EN, HA, IG, RM, UB, WD, WC, EC, W
If the area code is in this set → North. Otherwise → South.
Special case — SW postcodes: The SW area straddles the river. Some districts (SW1, SW3, SW5, SW6, SW7, SW10) are north of the Thames (Chelsea, Knightsbridge, Earl's Court), while others (SW2, SW4, SW8, SW9, SW11–SW20) are south (Brixton, Clapham, Battersea). The script extracts the district number from the outcode and checks it against a known set of north-side districts.
Even within north-side SW districts, a small number of individual postcodes are geographically south of the Thames or are better served by South London hospitals. These are hardcoded as exceptions and always classified as South, regardless of their district:
| Postcode | District | Reason for exception |
|---|---|---|
| SW1W 9FJ | SW1 | South of the Thames |
| SW1P 9UP | SW1 | South of the Thames |
| SW1Y 6YQ | SW1 | South of the Thames |
| SW1Y 5WT | SW1 | South of the Thames |
| SW1Y 4ZB | SW1 | South of the Thames |
| SW1Y 6WZ | SW1 | South of the Thames |
| SW1Y 5ZP | SW1 | South of the Thames |
| SW1X 9UQ | SW1 | South of the Thames |
| SW3 9DU | SW3 | South of the Thames |
| SW3 9EG | SW3 | South of the Thames |
| SW3 9GG | SW3 | South of the Thames |
In the code, these postcodes are stored without spaces (e.g. SW1W9FJ) because the comparison strips spaces from the input postcode before checking the set. The table above shows them in standard format for readability.
These exceptions are checked before the district-level classification, so they take priority.
Given a postcode at coordinates (lat, lon), find the closest hospital from a filtered subset. Doing this with brute force — calculating the distance to every hospital for every postcode — would mean 326,000 × 34 = ~11 million distance calculations, repeated for each care level.
A KD-tree (k-dimensional tree) is a data structure that organises points in space so you can quickly find the nearest one without checking every point.
Think of it like a series of binary decisions:
- Split all hospitals into two groups: those with latitude above the median and those below
- Within each group, split by longitude
- Keep splitting alternately by latitude and longitude
This creates a tree where nearby points end up in the same branches. To find the nearest hospital to a postcode, you descend the tree, only exploring branches that could contain a closer point than the best found so far. This reduces the search from O(n) to approximately O(log n).
KD-trees use Euclidean distance — they treat the coordinate axes as a flat grid. But the Earth is a sphere, and lines of longitude converge towards the poles. At the equator, 1° of longitude spans ~111 km. At London's latitude (51.5°N), it only spans:
Meanwhile, 1° of latitude is always ~111 km regardless of where you are.
If we feed raw (lat, lon) into the KD-tree, it would think a hospital 1° east is the same distance as one 1° north — but in reality the eastward one is ~40% closer. This would give wrong nearest-neighbour results.
The fix: Before building the KD-tree, we scale longitude by
This makes the coordinate space approximately proportional to real-world distances, so the KD-tree finds the genuinely nearest hospital. We only use this scaling for the search — actual distances are calculated separately using the Haversine formula.
Instead of looping through 326,000 postcodes one at a time, the script:
- Groups all postcodes by side (North/South)
- For each side × level combination, builds one KD-tree
- Queries it with all postcodes in that group at once
This means only 8 KD-tree builds (2 sides × 4 levels) instead of ~1.3 million, reducing runtime from potentially hours to seconds.
On a flat surface, the distance between two points is:
But the Earth is a sphere (approximately). Over short distances in London this error is small, but the Haversine formula gives us accurate results without any flat-Earth approximation.
Given two points with latitude and longitude in radians:
- Point 1:
$(\phi_1, \lambda_1)$ - Point 2:
$(\phi_2, \lambda_2)$
First, compute the intermediate value
Then the distance is:
Where
The Haversine formula computes the great-circle distance — the shortest path between two points along the surface of a sphere. Imagine stretching a string taut between two points on a globe; that's the great-circle distance.
The value
-
$\sin^2!\left(\frac{\Delta\phi}{2}\right)$ — how far apart the points are in the north-south direction -
$\cos(\phi_1) \cdot \cos(\phi_2) \cdot \sin^2!\left(\frac{\Delta\lambda}{2}\right)$ — how far apart they are east-west, adjusted for latitude (the$\cos$ terms shrink this component near the poles, where longitude lines converge)
The
For postcode BR1 1AA (51.4015°N, 0.0154°E) and Princess Royal hospital (51.3856°N, 0.0455°E):
- Convert to radians:
$\phi_1 = 0.8974$ ,$\lambda_1 = 0.000269$ ,$\phi_2 = 0.8971$ ,$\lambda_2 = 0.000794$ -
$\Delta\phi = -0.000278$ ,$\Delta\lambda = 0.000525$ $a = \sin^2(-0.000139) + \cos(0.8974) \cdot \cos(0.8971) \cdot \sin^2(0.000263)$ $a \approx 1.93 \times 10^{-8} + 0.3876 \times 6.89 \times 10^{-8} \approx 4.60 \times 10^{-8}$ -
$d = 2 \times 6371 \times \arcsin(\sqrt{4.60 \times 10^{-8}}) \approx 2.73$ km
postcodes_master.csv ──→ Classify N/S ──→ Group by side ──┐
├──→ KD-tree query ──→ Haversine distance ──→ Output CSV
hospitals_refined.csv ──→ Filter by side & level ──────────┘
| Step | Method | Purpose |
|---|---|---|
| Classification | Prefix parsing | Assign postcodes to correct side of the Thames |
| Nearest search | KD-tree with scaled coordinates | Efficiently find the closest hospital |
| Distance | Haversine formula | Accurately measure real-world distance in km |