Configuring SaTScan for Retrospective Cluster Detection

Retrospective cluster detection in public health surveillance requires deterministic execution, strict parameterization, and auditable data pipelines. SaTScan’s retrospective space-time scan statistic evaluates historical case distributions against expected baselines to identify statistically significant spatial or spatiotemporal aggregations. Production deployments in government and agency environments must eliminate boundary artifacts, enforce coordinate reference system (CRS) alignment, and guarantee Monte Carlo convergence within defined computational thresholds. This guide details the configuration workflow for agency-grade deployments, emphasizing compliance, spatial validation, and automated .prm generation.

Data Schema Alignment and Coordinate Standardization

SaTScan expects fixed-format text files with space- or tab-delimited columns in a defined order. Case files use the format <location_id> <cases> <date> (one record per location-date pair). Population files use <location_id> <population> <year>. Coordinate files use <location_id> <latitude> <longitude>. Unprojected WGS84 coordinates are acceptable for SaTScan’s native geographic distance calculations; SaTScan internally handles the spherical geometry. If you pre-compute projected coordinates for distance validation in Python, reproject back to geographic coordinates (EPSG:4326) before writing the .geo file.

Compliance frameworks (HIPAA, GDPR) prohibit raw address ingestion into scan engines. Geocoding must occur in isolated environments, with outputs aggregated to census tracts, hexagonal grids, or jittered centroids that satisfy k-anonymity thresholds. Population denominators should derive from official census releases or synthetic population models that exclude identifiable attributes. Temporal fields require truncation to day-level resolution to prevent re-identification through exact timestamp matching. Preprocessing pipelines must generate cryptographic checksums for all input exports to maintain chain-of-custody integrity. For foundational guidance on structuring these inputs within broader spatial workflows, refer to Disease Clustering & Spatial Statistical Modeling.

Parameter File Construction and Retrospective Mode

The .prm configuration file dictates SaTScan’s statistical behavior. Retrospective space-time analysis requires AnalysisType=3 (retrospective space-time) or AnalysisType=1 (purely spatial retrospective). The likelihood model must align precisely with the epidemiological data structure:

  • ModelType=1: Poisson — for case counts with a population denominator
  • ModelType=2: Bernoulli — for case-control data without a population file
  • ModelType=3: Space-time permutation — when no population file is available but temporal information is present

Misalignment between ModelType and input schema produces invalid likelihood ratios and silent statistical failures. Monte Carlo simulations (MonteCarloReps) should be set to 999 or 9999 for publication-grade p-values. Temporal window constraints (MaxTemporalSize, MaxTemporalSizeInterpretation) must reflect disease incubation periods and reporting lags. Spatial constraints (MaxSpatialSizeInPopulationAtRisk or MaxSpatialSizeInDistanceFromCenter) prevent overfitting and computational exhaustion. Official parameter definitions and version-specific syntax are documented in the SaTScan User Guide.

Automated Validation and Pipeline Execution

Production deployments require programmatic validation before engine invocation. The following Python pipeline verifies schema requirements, generates a compliant .prm file, and executes SaTScan via subprocess. SaTScan’s CLI accepts exactly one argument: the path to the parameter file.

import hashlib
import subprocess
import pandas as pd
from pathlib import Path

def generate_satscan_prm(case_path: str, pop_path: str, coords_path: str,
                          output_dir: str, model_type: int = 1,
                          num_sims: int = 999, max_spatial_pct: float = 0.5,
                          max_temporal_days: int = 30) -> str:
    """Generate an audit-ready .prm file for retrospective SaTScan execution."""

    # Validate input schemas
    case_df = pd.read_csv(case_path, sep=r'\s+', header=None,
                          names=['location_id', 'cases', 'date'])
    pop_df = pd.read_csv(pop_path, sep=r'\s+', header=None,
                         names=['location_id', 'population', 'year'])

    required_case_cols = {'location_id', 'cases', 'date'}
    required_pop_cols = {'location_id', 'population', 'year'}

    if not required_case_cols.issubset(case_df.columns):
        raise ValueError("Case file missing required columns.")
    if not required_pop_cols.issubset(pop_df.columns):
        raise ValueError("Population file missing required columns.")

    results_stem = str(Path(output_dir) / "sat_results")

    prm_content = f"""[Input]
CaseFile={case_path}
PopulationFile={pop_path}
CoordinatesFile={coords_path}
CoordinatesType=1

[Analysis]
AnalysisType=3
ModelType={model_type}
ScanAreas=1

[Output]
ResultsFile={results_stem}

[Spatial]
MaxSpatialSizeInPopulationAtRisk={max_spatial_pct}

[Temporal]
MaxTemporalSizeInterpretation=0
MaxTemporalSize={max_temporal_days}

[Inference]
MonteCarloReps={num_sims}
RandomSeed=42
"""

    prm_path = Path(output_dir) / "retrospective_scan.prm"
    prm_path.write_text(prm_content.strip())

    # Log checksums for audit trail
    for f in [Path(case_path), Path(pop_path), prm_path]:
        sha256 = hashlib.sha256(f.read_bytes()).hexdigest()
        print(f"[AUDIT] {f.name}: SHA256={sha256}")

    return str(prm_path)


def run_sat_scan(prm_path: str, executable: str = "satscan"):
    """Execute SaTScan with subprocess isolation and error trapping.

    SaTScan CLI usage: satscan <parameter_file>
    The executable name is lowercase 'satscan' on Linux/macOS.
    """
    cmd = [executable, prm_path]
    result = subprocess.run(cmd, capture_output=True, text=True, check=False)

    if result.returncode != 0:
        raise RuntimeError(f"SaTScan execution failed:\n{result.stderr}")
    print("[EXECUTION] SaTScan completed successfully. Outputs written to configured directory.")

For robust subprocess management in production environments, consult the official Python subprocess documentation.

Spatial Validation and Edge-Case Handling

Boundary artifacts occur when case coordinates fall outside the defined population grid or when spatial windows intersect jurisdictional boundaries. Implement a spatial join validation step prior to execution to flag orphaned cases. Zero-population zones in rural, maritime, or restricted areas cause division-by-zero errors in expected case calculations. These zones must be masked or assigned a minimum denominator (e.g., 1.0) with documented justification in the pipeline metadata.

Convergence failures during Monte Carlo runs often stem from overly restrictive temporal windows, mismatched coordinate precision, or insufficient population coverage. Implement retry logic with progressive window relaxation: if a run returns zero significant clusters, increment MaxSpatialSizeInPopulationAtRisk by 0.05 and re-run, logging all parameter adjustments. For advanced tuning strategies and threshold calibration, review Spatial Scan Statistics Configuration.

Audit-Ready Execution and Compliance Logging

Production pipelines require deterministic execution tracking. Log the exact .prm configuration, input file hashes, SaTScan binary version, and system environment variables. Store outputs in version-controlled directories with immutable timestamps. Ensure all spatial operations are reproducible via fixed random seeds (RandomSeed=42 in the .prm file).

Agencies should implement a post-execution validation routine that:

  1. Verifies output file existence and non-zero cluster counts
  2. Cross-references reported p-values against the configured MonteCarloReps
  3. Archives the .prm file alongside the raw and cleaned input files
  4. Generates a compliance manifest containing data lineage, transformation steps, and cryptographic hashes

Retrospective cluster detection is only as reliable as its configuration and validation layers. By enforcing CRS alignment, automating parameter generation, and embedding compliance logging directly into the execution pipeline, public health teams can deploy SaTScan at scale with audit-ready reproducibility and statistical integrity.