Model documentation

XGBoost technology and prototype forecasting model

This page first explains what XGBoost is as a forecasting technology, then documents the practical parameters and logic used in this Tartu public-safety prototype.

Back to forecasting map
Critical methodology explanation

This model is also human work

The model is not a neutral technical object. It is built from human decisions about what to predict, which algorithm to trust, which features to include, how to categorize places, how to weight errors, and how to translate predictions into operational risk.

1

Choosing XGBoost is already a design choice. Someone decided that a tree-based supervised model was the right framing, instead of a simpler baseline, a spatial model, a causal model, a simulation model, or no predictive model at all.

2

Defining the target and features is subjective work. Decisions such as predicting incident_count, using area-hour history, encoding weather, or including nightlife, student, vulnerability, and density markers shape what the model can learn.

3

Tuning, validation, thresholds, smoothing, and risk-band design are also human choices. Different people could reasonably choose different horizons, error metrics, feature sets, weights, and categories, producing different outputs from the same city.

All model outcomes inherit this human design work. The forecast is an accumulation of choices, assumptions, local know-how, constraints, and omissions. It can be done in very different ways, so its outputs should be read as situated estimates, not neutral facts.

Part 1

What XGBoost is

XGBoost stands for Extreme Gradient Boosting. It is a supervised machine-learning method that builds many decision trees in sequence, where each new tree tries to correct the errors made by the previous ones.

For forecasting work, XGBoost is often used when the input data contains mixed feature types, nonlinear relationships, threshold effects, and interactions between variables such as time, weather, land use, and neighborhood characteristics.

  • It handles structured tabular data very well.
  • It can capture interactions such as “Friday night + nightlife area + rain”.
  • It usually performs strongly without requiring a huge amount of feature scaling.
  • It can be tuned for regression tasks such as hourly incident-count prediction.
Part 2

How it would fit this project

In a full production version, the model target would be incident_count for one district at one hour. Candidate features would include:

  • district type, density, population, nightlife, student, and vulnerability markers
  • hour of day, day of week, weekend/night indicators, and seasonality
  • weather context and persistence blocks
  • historical incident patterns from past area-hour combinations

The application layer would then convert predicted incident counts into patrol-planning signals, risk bands, and operational summaries.

Prototype model

Parameters used in this demo

The current map uses a deterministic, readable forecast engine inspired by an ML workflow. It blends historical area-hour behavior, citywide temporal pulse, weather context, and a small motion wave.

Component Current setting What it does
Forecast horizon 48 hours Builds one citywide hourly outlook for the next two days.
Area exact history weight 0.55 Uses the same district + weekday + hour pattern as the main anchor.
Area hour fallback weight 0.18 Falls back to district + hour behavior across all weekdays.
Area day fallback weight 0.12 Adds district + weekday behavior independent of exact hour.
Area mean weight 0.10 Keeps each district tied to its overall historical baseline.
City pulse bridge weight 0.05 Injects citywide temporal pressure into the district baseline.
City pulse clamp 0.70 to 1.34 Prevents citywide timing effects from becoming unrealistically large.
Weather factors clear 1.04, rain 0.92, snow 0.82, storm 0.88, extremes 0.90 Modulates hourly incident expectations by weather state.
Weather outlook logic Mostly clear, with rain blocks lasting 3 to 6 hours Keeps the short forecast visually stable and realistic.
Motion wave 1 +/- 0.05 sinusoidal variation Adds slight hour-to-hour movement so adjacent hours are not identical.
Series smoothing 0.22 previous, 0.56 current, 0.22 next Smooths the hourly forecast to avoid harsh spikes between neighboring hours.
Risk score formula 0.68 incident intensity + 0.32 per-capita pressure Converts forecast counts into a 1 to 10 operational risk score.
Operational thresholds

Current patrol strategy bands

  • Risk 8-10: visible patrol presence and fast-response positioning
  • Risk 6-7: targeted patrol pass during the higher-demand window
  • Risk 1-5: routine coverage with watchlist awareness
Future upgrade path

What a full ML version would add

  • real XGBoost model training and validation on rolling historical windows
  • feature importance and SHAP-style interpretability layer
  • separate calibration of prediction intervals and rare-event handling
  • automatic refresh from newly ingested police and weather feeds
Next-generation concept

How the model and technology could be improved

A stronger future version would go beyond the current district-level prototype and evolve into a richer operational forecasting service. The main upgrade path is to add more informative features, more granular spatial modeling, live API integrations, and a scalable product layer around the map.

1. Richer feature engineering

The next model should ingest many more predictive features. Examples include incident category detail, such as whether the event was traffic-related, disturbance-related, or another specific public-order type. That would let the model distinguish between very different demand patterns instead of learning only one generic incident-count signal.

  • incident category and subcategory
  • traffic-specific versus public-order-specific signals
  • event calendars, paydays, school cycles, and nightlife seasonality
  • lagged history, persistence features, and confidence intervals

2. Finer GIS and spatial resolution

Instead of forecasting only by suburb, the city could be divided into a much finer GIS grid, for example a 100 x 100 meter lattice. That would produce a more realistic map of hot corridors, nightlife clusters, transport routes, and recurring micro-locations.

  • 100 x 100 meter grid instead of suburb-only aggregation
  • street-network, transport-node, and land-use context
  • better distinction between corridor effects and broad district averages

3. Live APIs and continuously refreshed predictions

The application could communicate with weather services through APIs and use live weather feeds when computing each new hourly forecast. In a fuller public-sector ecosystem, the system could also ingest near-real-time incident counts from aggregated public APIs and refresh each area-hour prediction automatically.

  • live weather API integration
  • future aggregated incident feeds from public-sector APIs
  • hourly refresh cycle for every area and every forecast step

4. Interactive product layer

The map application itself is already the right direction: an interactive Tartu city map that shows the next 48 hours of estimated public-order load by area, based on historical patterns, weather, weekday structure, and events. The stronger version would keep the heatmap, the hourly slider, the numerical incident forecast, and the risk-score overlay, but drive them from a continuously updated production model.

In that setup, the user could inspect each area hour by hour, view risk scores directly on the map, and read clearer risk interpretations below the legend or in supporting panels.

5. Scalable service concept

The same logic could be scaled from one city to a county-level or state-level service. Tartu would act as the pilot project, and the technical stack could then be extended gradually to new territories without changing the basic forecasting concept.

One possible product concept is CityWatch.ai: a licensed forecasting and map plugin that a local government could embed into its own city website or run on a dedicated public-safety page.

6. Operating model and commercialization

In that scenario, the municipality would pay a service license fee, for example EUR 1000 per month, while also supplying the aggregated incident feed through cooperation with local police. Because the service would work with area-level aggregated counts rather than direct personal identifiers, the data model could remain relatively lightweight while still being operationally useful.

This would turn the prototype from a one-off demonstration into a reusable forecasting service with a clearer ownership model, maintenance path, and deployment story.