Handling API Timeouts in Batch OSM Routing
In spatial epidemiology and public health infrastructure planning, calculating drive-time accessibility across large patient cohorts requires deterministic, fault-tolerant network analysis. When scaling origin-destination (OD) queries to tens of thousands of facility-patient pairs, API timeouts become a critical engineering constraint. Unhandled timeouts introduce spatial sampling bias, compromise healthcare access equity metrics, and violate data integrity requirements for regulatory reporting. Production-grade pipelines must implement stateful retry architectures, topology-aware payload partitioning, and audit-compliant error tracking. This operational rigor aligns directly with established practices in Healthcare Access & Network Analysis Automation where statistical validity depends on complete spatial coverage and reproducible query execution.
Root Cause Analysis in Routing Engines
OSM-derived routing engines (e.g., OSRM, Valhalla) execute graph traversals that scale non-linearly with coordinate density, edge complexity, and turn restrictions. Timeouts typically manifest from three vectors: server-side query saturation, client-side serialization overhead, and transient network degradation. In public health workflows, unvalidated coordinate arrays or mismatched coordinate reference systems (CRS) force engines to perform redundant spatial joins or projection transformations, inflating latency. Additionally, batch matrices exceeding engine-specific limits trigger silent queueing or hard 504 Gateway Timeout responses. Understanding these failure modes is essential for designing resilient Batch Routing & Error Handling architectures that preserve spatial accuracy under sustained computational load.
Retry Architecture & Compliance Logging
Transient failures require stateful retry mechanisms rather than naive polling loops. Production systems should implement exponential backoff with randomized jitter to prevent thundering-herd effects on shared routing infrastructure. The retry strategy must strictly differentiate between recoverable HTTP status codes (429, 502, 504) and terminal failures (400, 404, invalid geometries). A circuit breaker pattern halts requests when consecutive failures exceed a defined threshold, preventing cascading pipeline degradation.
Each attempt must log a deterministic payload hash, UTC timestamp, and spatial bounding box. Raw patient identifiers must never be persisted in retry logs, ensuring alignment with HIPAA minimum necessary standards and GDPR data minimization principles. The Python logging module should be configured with structured JSON formatters to enable downstream audit parsing and compliance verification.
A single OD chunk flows through the retry layer, which classifies failures and re-issues recoverable requests with backoff before returning a validated matrix:
sequenceDiagram participant P as Pipeline participant R as Retry layer participant E as Routing engine P->>R: submit OD chunk R->>E: Table API request (attempt 1) E-->>R: 504 Gateway Timeout Note over R: exponential backoff + jitter R->>E: Table API request (attempt 2) E-->>R: 200 OK, durations matrix R-->>P: validated travel-time matrix
Payload Optimization & CRS Alignment
Batch routing efficiency depends on strategic request partitioning. Monolithic coordinate matrices should be replaced with topology-aware chunks based on spatial proximity and network boundaries. Pre-processing must enforce consistent coordinate precision (typically six decimal places for ~0.11 m resolution at the equator) and validate geometries against the routing engine’s expected CRS (OSRM and most OSM-based engines expect EPSG:4326). Spatial indexing via geopandas enables efficient chunk generation that minimizes cross-boundary route fragmentation. Implementing these optimizations reduces payload serialization overhead and keeps individual requests within engine timeout thresholds, as documented in the OSRM Table API specifications.
Production-Ready Python Implementation
The following pipeline demonstrates deterministic retry logic, spatial validation, and audit-compliant logging using tenacity for backoff management and requests for HTTP execution.
import hashlib
import json
import logging
import geopandas as gpd
import requests
from shapely.geometry import Point
from tenacity import (
retry, stop_after_attempt, wait_exponential,
retry_if_exception_type, retry_if_result
)
from requests.exceptions import HTTPError, Timeout, ConnectionError
# Configure audit-ready structured logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s | %(levelname)s | %(message)s',
handlers=[logging.FileHandler("routing_audit.log")]
)
logger = logging.getLogger("osm_batch_router")
def generate_payload_hash(coords):
"""Deterministic hash for audit trails without storing raw coordinates."""
return hashlib.sha256(json.dumps(coords, sort_keys=True).encode()).hexdigest()[:12]
def _is_recoverable(response):
"""Return True if response indicates a recoverable failure for tenacity to retry."""
return hasattr(response, 'status_code') and response.status_code in {429, 502, 504}
@retry(
retry=(retry_if_exception_type((Timeout, ConnectionError)) | retry_if_result(_is_recoverable)),
wait=wait_exponential(multiplier=1, min=2, max=30),
stop=stop_after_attempt(5),
reraise=True
)
def fetch_route_batch(engine_url, coords_chunk, timeout=15):
"""
POST a coordinate chunk to the OSRM Table API.
coords_chunk is a list of [lon, lat] pairs used as both sources and destinations.
"""
payload = {"coordinates": coords_chunk, "sources": "all", "destinations": "all"}
payload_hash = generate_payload_hash(coords_chunk)
try:
resp = requests.post(engine_url, json=payload, timeout=timeout)
if resp.status_code == 200:
logger.info(f"SUCCESS | hash={payload_hash} | duration={resp.elapsed.total_seconds():.2f}s")
return resp.json()
elif _is_recoverable(resp):
logger.warning(f"RECOVERABLE | hash={payload_hash} | status={resp.status_code}")
return resp # tenacity will trigger retry via retry_if_result
else:
logger.error(f"TERMINAL | hash={payload_hash} | status={resp.status_code} | body={resp.text[:100]}")
resp.raise_for_status()
except (Timeout, ConnectionError) as e:
logger.warning(f"NETWORK_ERROR | hash={payload_hash} | detail={str(e)}")
raise
def chunk_od_pairs(gdf_origins, gdf_destinations, chunk_size=25):
"""Topology-aware chunking based on spatial proximity."""
# OSRM expects EPSG:4326 (longitude, latitude order)
gdf_origins = gdf_origins.to_crs("EPSG:4326")
gdf_destinations = gdf_destinations.to_crs("EPSG:4326")
# Round to 6 decimals (~0.1 m) for deterministic, privacy-safe payloads
origins_coords = [
[round(geom.x, 6), round(geom.y, 6)] for geom in gdf_origins.geometry
]
destinations_coords = [
[round(geom.x, 6), round(geom.y, 6)] for geom in gdf_destinations.geometry
]
# Combine into a single coordinate list for the Table API
all_coords = origins_coords + destinations_coords
return [all_coords[i:i + chunk_size] for i in range(0, len(all_coords), chunk_size)]
def run_batch_routing(engine_url, origins_path, destinations_path):
origins = gpd.read_file(origins_path)
destinations = gpd.read_file(destinations_path)
chunks = chunk_od_pairs(origins, destinations)
results = []
for i, chunk in enumerate(chunks):
try:
results.append(fetch_route_batch(engine_url, chunk))
except Exception as e:
logger.critical(f"CHUNK_{i}_FAILED | detail={str(e)}")
# Implement fallback: mark chunk for manual review or secondary engine
return results
Spatial Validation & Audit Readiness
Post-processing must verify spatial completeness before calculating accessibility indices. Missing routes should be explicitly flagged rather than imputed to prevent bias in spatial equity metrics. Implement validation checks that verify returned travel times fall within physiologically plausible ranges (e.g., >0 and <24 hours) and cross-reference route geometries against known facility catchments.
Audit trails must support full reproducibility for public health reporting and peer review. Store chunk hashes, retry counts, and final status codes in a version-controlled metadata table. This pattern ensures compliance with federal spatial data standards while maintaining the statistical integrity required for epidemiological modeling and resource allocation decisions.