Detecting space-time anomalous regions to improve real estate portfolio management
Last updated
Last updated
A quick start version of this guide is available here.
From disease surveillance systems, to detect spikes in network usage, or environmental monitoring systems, many applications require the monitoring of time series data in order to detect anomalous data points. In these event detection scenarios, the goal is to either uncover anomalous patterns in historical space-time data or swiftly and accurately detect emerging patterns, thereby enabling a timely and effective response to the detected events.
As a concrete example, in this guide we will focus on the task of detecting spikes in violent crimes in the city of Chicago in order to improve portfolio management of real estate insurers.
This guide shows how to use CARTO space-time anomaly detection functionality in the Analytics Toolbox for BigQuery. Specifically, we will cover:
A brief introduction to the method and to the formulations of the definition of anomalous, unexpected, or otherwise interesting regions
How to identify anomalous space-time regions using the DETECT_SPACETIME_ANOMALIES
function
By the end of this guide, you will have detected anomalous space-time regions in time series data of violent crimes in the city of Chicago using different formulations of the anomaly detection problem.
A variety of methods have been developed to monitor time series data and to detect any observations outside a critical range. These include outlier detection methods and approaches that compare each observed data point to its baseline value, which might represent the underlying population at risk or an estimate of the expected value. The latter can be derived from a moving window average or a counterfactual forecast obtained from time series analysis of the historical data, as can be for example obtained by fitting an Arima model to the historical data using the ARIMA_PLUS or the ARIMAS_PLUS_XREG model classes in Google BigQuery.
To detect anomalies that affect multiple time series simultaneously, we can either combine the outputs of multiple univariate time series or treat the multiple time series as a single multivariate quantity to be monitored. However, for time series that are also localized in space, we expect that if a given location is affected by an anomalous event, then nearby locations are more likely to be affected than locations that are spatially distant.
A typical approach to the monitoring of spatial time series data uses fixed partitions, which requires defining an a priori spatial neighborhood and temporal window to search for anomalous data. However, in general, we do not have a priori knowledge of how many locations will be affected by an event, and we wish to maintain high detection power whether the event affects a single location (and time), all locations (and times), or anything in between. A coarse partitioning of the search space will lose power to detect events that affect a small number of locations (and times), since the anomalous time series will be aggregated with other non-anomalous data. A fine partitioning of the search space will lose power to detect events that affect many locations (and times), since only a small number of anomalous time series are considered in each partition. Partitions of intermediate size will lose some power to detect both very small and very large events.
A solution to this problem is a multi-resolution approach in which we search over a large and overlapping set of space-time regions, each containing some subset of the data, and find the most significant clusters of anomalous data. This approach, which is known as thegeneralized space-time scan statistics framework, consists of the following steps:
Choose a baseline.
Calculate the statistical significance of each discovered region using Monte Carlo randomization: generate random permutations of the data where each replica is a copy of the original search area where each value is randomly drawn from the null distribution; for each permutation, select the space-time zone associated with the maximum score and fit a Gumbel distribution to the maximum scores to derive an empirical p-value.
While anomaly detection typically focuses on single data points and asks whether each point is anomalous, space-time anomaly detection focuses on finding space-time groups or patterns which are anomalous, even if each individual point in the group might not be surprising on its own.
Overall, clustering and space-time anomaly detection have very different goals (partitioning data into groups versus finding statistically anomalous regions). Nevertheless, some clustering methods, commonly referred to as density-based clustering (e.g. DBSCAN), partition the data based on the density of points and as a result we might think that these partitions may correspond to the anomalous regions that we are interested in detecting. However density-based clustering is not adequate for the space-time anomaly detection task: first we also want to draw substantial conclusions about the regions we find (whether each region represents a significant cluster or is likely to have occurred by chance); and secondly, we want to be able to deal adequately with spatially (and temporally) varying baselines, while density-based clustering methods are specific to the notion of density as number of points per unit area.
Based on methods like the Getis-Ord Gi* statistics and hotspot analysis can be used to identify regions with high or low event intensity. It works by comparing proportionally the local sum of an attribute to the global sum, resulting in a z-score for each observation: observations with a regional sum significantly higher or lower than the global sum are considered to have statistically significant regional similarity above or below the global trend. However, unlike space-time anomaly detection, it uses a fixed spatial and/or temporal window, and is more exploratory and not suitable for inferential analysis.
Crime data is often an overlooked component in property risk assessments and rarely integrated into underwriting guidelines, despite the FBI's latest estimates indicating over $16 billion in losses annually from property crimes only. In this example, we will use the locations of violent crimes in Chicago available in BigQuery public marketplace, extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. The data are available daily from 2001 to present, minus the most recent seven days, which also allows to showcase how to use this method to detect space-time anomalies in almost-real-time.
For the purpose of this guide, the data were first aggregated weekly (by assigning each daily data to the previous Monday) and by H3 cell at resolution 7, as shown in this map, where we can visualise the total counts for the whole period by H3 cell and the time series of the H3 cells with most counts
Each H3 cell has been further enriched using demographic data from the American Community Survey (ACS) at the census block resolution. Finally, each time series has been gap filled to remove any gap by assigning a zero value to the crime counts variable. The final data can be accessed using this query
We start by detecting the space-time anomalies in counts of violent crimes with respect to the population at risk, given by the H3 total population enriched with data from the 5-year American Community Survey (ACS) at the census block resolution. In this approach to define baseline values, named population-based ('estimation_method':'POPULATION'
), we expect the crime counts to be proportional to the baseline values, which typically represent the population corresponding to each space-time location and can be either given (e.g. from census data) or inferred (e.g. from sales data), and can be adjusted for any known covariates (such as age of population, risk factors, seasonality, weather effects, etc.). Specifically, we wish to detect space-time regions where the observed rates are significantly higher inside than outside.
Assuming that the counts are Poisson distributed (which is the typical assumption for count data, 'distributional_model':'POISSON'
), we can obtain the space-time anomalies using the following query
As we can see from the query above, in this case we are looking retrospectively for past anomalous space-time regions ('is_prospective: false'
, i.e. a temporal zone can end at any timestamp) with spatial extent with a k-ring ('kring_size'
) between 1 (first order neighbours) and 3 (third order neighbors) and a temporal extent ('time_bw'
) between 2 and 6 weeks. Finally, the 'permutations'
parameter is set to define the number of permutations used to compute the statistical significance of the detected anomalies. As noted above, empirical results suggest that the null distribution of the scan statistic is fit well by a Gumbel extreme value distribution and can be used to obtain empirical p-values for the spatial scan statistic with great accuracy in the far tail of the distribution: for a smaller number of replications under the null we can calculate very small p-values (for example, p-values on the order of 0.00001 can be accurately calculated with only 999 random replicates by using the Gumbel approximation, while it would require more than 999,999 replicates to get the same power and precision from Monte Carlo hypothesis testing). The results of this experiment are show in this map
As we can see from this map, the space-time zone with the largest score (whose extent is shown in the right panel) has a higher relative risk than the rest of the data.
Another way of interpreting the baselines, is to assume that the observed values should be equal (and not just proportional as in the population-based approach) to the baseline under the null hypothesis of no anomalous space-time regions. This approach, named expectation-based, requires an estimate of the baseline values which are inferred from the historical time series, potentially adjusting for any relevant external effects such as day-of-week and seasonality.
Computing the expected counts with a moving average
A simple way of estimating the expected crime counts is to compute a moving average of the weekly counts for each H3 cell. For example, we could average each weekly value over the span between the previous and next three weeks
The map below shows the spatial and temporal extent of the ten most anomalous regions (being the region with rank 1, the most anomalous), together with the time series of the sum of the counts and baselines (i.e. the moving average values) for the time span of the selected region
Computing the expected counts from a time series model
To improve the estimate of baseline values, we could also infer these values using a time series model of the past observations that can allow for seasonal and holiday effects. This can be achieved by fitting any standard time series analysis methods, such as a ARIMA model to the time series of each H3 cell
The baseline values can be then computed by subtracting the residuals to the observed counts, by calling the ML.EXPLAIN_FORECAST function
And using the same procedure call as before, we can get the most 10 anomalous regions for the newly computed baselines
Whether to use a simple moving average or a time-series model to infer the baselines, depends on the question that we are trying to answer (e.g. if the expected values should be adjusted for day of the week, seasonal, and holiday effects) as well as on the type and quality of data (how long the time series is, how noisy, etc.). To further investigate the differences between a moving average or an ARIMA-based model, we can plot the difference between the observed values and the baseline values for each method, as shown here for the ten H3 cells with the most number of crimes
Adjusting the expected counts to include external effects
For many cases, we also want to adjust the baseline values for any known covariate such as weather effects, mobility trends, age of population, income, etc. For example, here, we might include the effects from the census variables derived from ACS 5-years averages like the median age, the median rent, the black and hispanic population ratios, the owner and vacant occupied housing units ratio, and the ratio of families with young children. To include these additional effects, we can run for each H3 cell an ARIMA model with external covariates and get the covariate-adjusted predictions
For easy understanding, we have already joined the results for each H3 cell into a table
Given these covariate-adjusted baselines, we can use the procedure to detect space-time anomalies with the same options as before and get the most 10 anomalous regions for the newly computed baselines
The examples given so far showed how to detect anomalies retrospectively ('is_prospective: false'
) , which means that the whole time series is available and the space-time anomalies can happen at any point in time over all the past data (a temporal zone can end at any timestamp). However, the procedure can also be applied when the interest relies on detecting emerging anomalies ('is_prospective: true') for which the search focuses only on the final part of the time series (a temporal zone can only have as its end point the last timestamp). The prospective case is useful especially with real-time data, as in this case the goal is detecting anomalies as quickly as possible. On the other hand, a retrospective analysis is more useful to understand past-events, improve operational processes, validate models, etc.
Whether to use an expectation-based approach or a population-based approach depends both on the type and quality of data, as well as the types of anomalies we are interested in detecting.
Absolute VS relative baselines. If we only have relative (rather than absolute) information about what we expect to see, a population-based approach should be used.
Detection power. The expectation-based approach should be used when we can accurately estimate the expected values in each space-time location, either based on a sufficient amount of historical data, or based on sufficient data from a null or control condition; in these cases, expectation-based statistics will have higher detection power than population-based statistics.
Local VS global changes. If the observed values throughout the entire search region are much higher (or lower) than expected, the expectation-based approach will find these changes very significant but if these do not vary spatially and/or temporally the population-based method will not find any significant anomalous space-time regions. If we assume that such changes have resulted from large space-time regions (and are therefore relevant to detect), the expectation-based approach should be used. On the other hand, if we assume that these changes have resulted from unmodelled and irrelevant global trends (and should therefore be ignored), then it is more appropriate to use the population-based approach.
When the data does not have a temporal component, a similar approach can be applied to detect spatial anomalies using the DETECT_SPATIAL_ANOMALIES
procedure. In this case we are also interested in detecting regions that are anomalous with respect to some baseline, that, as for the space-time case, can be computed with the population- or expectation-based approaches. For the latter, typically a regression model (e.g. a linear model) is required, which is used to estimate the expected values and their variances conditional on some covariates.
Choose a set of spatial regions to search over, where each space-time region consists of a set of space-time locations (e.g. defined using spatial indexes).
Choose models of the data under (the null hypothesis of no cluster of anomalies) and (the alternative hypothesis assuming an anomalous cluster in region ). Here we assume that that each location's value is drawn independently from some distribution where represents the set of baseline values of that location, and represents some underlying relative risk parameter. Second, we make the assumption that the relative risk is uniform under the null hypothesis: thus we assume that any space-time variation in the values under the null is accounted for by our baseline parameters and our methods are designed to detect any additional variation not reflected in these baselines.
Derive a score function based on the likelihood test ratio statistic .
Find the most interesting regions, i.e. those regions S with the highest values of .