Time series clustering: Identifying areas with similar traffic accident patterns

Spatio-temporal analysis is crucial in extracting meaningful insights from data that possess both spatial and temporal components. By incorporating spatial information, such as geographic coordinates, with temporal data, such as timestamps, spatio-temporal analysis unveils dynamic behaviors and dependencies across various domains. This applies to different industries and use cases like car sharing and micromobility planning, urban planning, transportation optimization, and more.

In this example, we will perform spatio-temporal analysis to identify areas with similar traffic accident patterns over time using the location and time of accidents in London in 2021 and 2022, provided by Transport for London. This tutorial builds upon this previous one, where we explained how to use the spacetime Getis-Ord functionality to identify traffic accident hotspots.

Data

The source data we use has two years of weekly aggregated data into an H3 grid, counting the number of collisions per cell. The data is available at cartobq.docs.spacetime_collisions_weekly_h3 and it can be explored in the map below.

Spacetime Getis-Ord

We start by performing a spacetime hotspot analysis to better understand our data. We can use the following call to the Analytics Toolbox to run the procedure:

CALL `carto-un`.carto.GETIS_ORD_SPACETIME_H3_TABLE(
 'cartobq.docs.spacetime_collisions_weekly_h3',
 'cartobq.docs.spacetime_collisions_weekly_h3_gi',
 'h3',
 'week',
 'n_collisions',
 3,
 'WEEK',
 1,
 'gaussian',
 'gaussian'
);

For further detail on the spacetime Getis-Ord, take a look at the documentation and this tutorial.

By performing this analysis, we can check how different parts of the city become “hotter” or “colder” as time progresses.

Finding time series clusters

Once we have an initial understanding of the spacetime patterns of our data, we proceed to cluster H3 cells based on their temporal patterns. To do this, we use the TIME_SERIES_CLUSTERING procedure, which takes as input:

  • input: The query or fully qualified name of the table with the data

  • output_table: The fully qualified name of the output table

  • partitioning_column: Time series unique IDs, which in this case are the H3 indexes

  • ts_column: Name of the column with the value per ID and timestep

  • value_column: Name of the column with the value per ID and timestep

  • options: A JSON containing the advanced options for the procedure

One of the advanced options is the time series clustering method. Currently, it features two basic approaches:

  • Value characteristic that will cluster the series based on the step-by-step distance of its values. One way to think of it is that the closer the signals, the closer the series will be understood to be and the higher the chance of being clustered together.

  • Profile characteristic that will cluster the series based on their dynamics along the time span passed. This time, the closer the correlation between two series, the higher the chance of being clustered together.

Clustering the series as-is can be tricky since these methods are sensitive to the noise in the series. However, since we smoothed the signal using the spacetime Getis-Ord before, we could try clustering the cells based on the resulting temperature. We will only consider those cells with at least 60% of their observations with reasonable significance.

CALL `carto-un`.carto.TIME_SERIES_CLUSTERING(
 '''
   SELECT * FROM `cartobq.docs.spacetime_collisions_weekly_h3_gi`
   QUALIFY PERCENTILE_CONT(p_value, 0.6) OVER (PARTITION BY index) < 0.05
 ''',
 'cartobq.docs.spacetime_collisions_weekly_h3_clusters',
 'index',
 'date',
 'gi',
 JSON '{ "method": "profile", "n_clusters": 4 }'
);

Even if it can feel like some layers of indirection, this provides several advantages:

  • Since it has been temporally smoothed, noise has been reduced in the dynamics of the series;

  • and since it has been geographically smoothed, nearby cells are more likely to be clustered together.

This map shows the different clusters that are returned as a result:

We can immediately see the different dynamics in the widget:

  • Apart from cluster #3, which clearly clumps the “colder” areas, the rest start 2021 with very similar accident counts.

  • However, from July 2021 onwards, cluster #2 accumulates clearly more collisions than the other two.

  • Even though #1 and #4 have similar levels, certain points differ, like September 2021 or January 2022.

This information is incredibly useful to kickstart a further analysis to understand the possible causes of these behaviors, and we were able to extract these insights at a single glance at the map. This method “collapsed” the results of the space-time Getis-Ord into a space-only result, which makes the data easier to explore and understand.

Last updated