Detecting space-time anomalous regions to improve real estate portfolio management (quick start)
Last updated
Last updated
A more comprehensive version of this guide is available here.
From disease surveillance systems, to detect spikes in network usage, or environmental monitoring systems, many applications require the monitoring of time series data in order to detect anomalous data points. In these event detection scenarios, the goal is to either uncover anomalous patterns in historical space-time data or swiftly and accurately detect emerging patterns, thereby enabling a timely and effective response to the detected events.
As a concrete example, in this guide we will focus on the task of detecting spikes in violent crimes in the city of Chicago in order to improve portfolio management of real estate insurers.
This guide shows how to use CARTO space-time anomaly detection functionality in the Analytics Toolbox for BigQuery. Specifically, we will cover:
A brief introduction to the method and to the formulations of the definition of anomalous, unexpected, or otherwise interesting regions
How to identify anomalous space-time regions using the DETECT_SPACETIME_ANOMALIES
function
By the end of this guide, you will have detected anomalous space-time regions in time series data of violent crimes in the city of Chicago. A more comprehensive version of this guide can be found here.
Crime data is often an overlooked component in property risk assessments and rarely integrated into underwriting guidelines, despite the FBI's latest estimates indicating over $16 billion in losses annually from property crimes only. In this example, we will use the locations of violent crimes in Chicago available in BigQuery public marketplace, extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. The data are available daily from 2001 to present, minus the most recent seven days, which also allows to showcase how to use this method to detect space-time anomalies in almost-real-time.
For the purpose of this guide, the data were first aggregated weekly (by assigning each daily data to the previous Monday) and by H3 cell at resolution 7, as shown in this map, where we can visualize the total counts for the whole period by H3 cell and the time series of the H3 cells with most counts
Each H3 cell has been further enriched using demographic data from the American Community Survey (ACS) at the census block resolution. Finally, each time series has been gap filled to remove any gap by assigning a zero value to the crime counts variable. The final data can be accessed using this query
To detect anomalies that affect multiple time series simultaneously, we can either combine the outputs of multiple univariate time series or treat the multiple time series as a single multivariate quantity to be monitored. However, for time series that are also localised in space, we expect that if a given location is affected by an anomalous event, then nearby locations are more likely to be affected than locations that are spatially distant.
A typical approach to the monitoring of spatial time series data uses fixed partitions, which requires defining an a priori spatial neighbourhood and temporal window to search for anomalous data. However, in general, we do not have a priori knowledge of how many locations will be affected by an event, and we wish to maintain high detection power whether the event affects a single location (and time), all locations (and times), or anything in between.
A solution to this problem is a multi-resolution approach in which we search over a large and overlapping set of space-time regions, each containing some subset of the data, and find the most significant clusters of anomalous data. This approach, which is known as the generalized space-time scan statistics framework, consists of computing a score function that compares the probability that a space-time region is anomalous compared to some baseline to the probability of no anomalous regions. The region(s) with the highest value of the score for which the result is significant for some significance level are identified as the (most) anomalous.
Depending on the type of anomalies that we are interested in detecting, different baselines can be chosen
Population-based baselines ('estimation_method':'POPULATION'
). In this case we only have relative (rather than absolute) information about what we expect to see and we expect the observed value to be proportional to the baseline values. These typically represent the population corresponding to each space-time location and can be either given (e.g. from census data) or inferred (e.g. from sales data), and can be adjusted for any known covariates (such as age of population, risk factors, seasonality, weather effects, etc.)
Expectation-based baselines ('estimation_method':'EXPECTATION'
). Another way of interpreting the baselines, is to assume that the observed values should be equal (and not just proportional as in the population-based approach) to the baseline under the null hypothesis of no anomalous space-time regions. This approach requires an estimate of the baseline values which are inferred from the historical time series, potentially adjusting for any relevant external effects such as day-of-week and seasonality. Such estimate can be derived from a moving window average or a counterfactual forecast obtained from time series analysis of the historical data, as can be for example obtained by fitting an Arima model to the historical data using the ARIMA_PLUS or the ARIMAS_PLUS_XREG model classes in Google BigQuery.
A simple way of estimating the expected crime counts is to compute a moving average of the weekly counts for each H3 cell. For example, we could average each weekly value over the span between the previous and next three weeks
Assuming that the counts are Poisson distributed (which is the typical assumption for count data, 'distributional_model':'POISSON'
), we can obtain the space-time anomalies using the following query
As we can see from the query above, in this case we are looking retrospectively for past anomalous space-time regions ('is_prospective: false'
, i.e. the space-time anomalies can happen at any point in time over all the past data as opposed to emerging anomalies for which the search focuses only on the final part of the time series) with spatial extent with a k-ring ('kring_size'
) between 1 (first order neighbours) and 3 (third order neighbours) and a temporal extent ('time_bw'
) between 2 and 16 weeks. Finally, the 'permutations'
parameter is set to define the number of permutations used to compute the statistical significance of the detected anomalies.
The map below shows the spatial and temporal extent of the ten most anomalous regions (being the region with rank 1, the most anomalous), together with the time series of the sum of the counts and baselines (i.e. the moving average values) for the time span of the selected region
To explore the effect of choosing different baselines and parameters check the extended version of this guide, where the method is described in more detail and we offer step-by-step instructions to implement various configurations of the procedure.