Optimizing your data for spatial analysis
It's not uncommon for geospatial datasets to be larger than their non-geospatial counterparts, and geospatial operations are sometimes slow or resource-demanding — but that's not a surprise: representing things and events on Earth and then computing their relationships is not an easy task.
With CARTO, you will unlock a way to do spatial analytics at scale, combining the huge computational power of your data warehouse with our expertise and tools, for millions or billions of data points. And we'll try to make it easy for you!.
In this guide we'll help you prepare your data so that it is optimized for spatial analysis with CARTO.
Benefits of using optimized data
Having clean, optimized data at the source (your data warehouse) will:
Improve the performance of all analysis, apps, and visualizations made with CARTO
Reduce the computing costs associated in your data warehouse
General tips and rules
Before we start diving into the specific optimizations and tricks available in your data warehouse, there are some typical data optimization patterns that apply to all data warehouses:
Optimization rule #1 — Can you reduce the volume of data?
While CARTO tries to automatically optimize the amount of data requested, having a huge source table is always a bigger challenge than having a smaller one.
Sometimes we find ourselves trying to use a huge table called raw_data
with 50TBs of data only to then realize: I actually don't need all the data in this table!
If that's your case and the raw data is static, then it's a good idea to materialize in a different (smaller) table the subset or aggregation that you need for your use case.
If that's your case and the raw data changes constantly, then it might be a good idea to build a data pipeline that refreshes your (smaller) table. You can build it easily using CARTO Workflows.
Optimization rule #2 — Are you using the right spatial data type?
If you've read our previous guides, you already know CARTO supports multiple spatial data types.
Each data type has its own particularities when speaking about performance and optimization:
Points: points are great to represent specific locations but dealing with millions or billions of points is typically a sub-optimal way of solving spatial challenges. Consider aggregating your points into spatial indexes using CARTO Workflows.
Polygons: polygons typically reflect meaningful areas in our analysis, but they quickly become expensive if using too many, too small, or too complex polygons. Consider simplifying your polygons or using a higher-level aggregation to reduce the number of polygons. Both of these operations can be achieved with CARTO Workflows.
Polygons are also known to become invalid geometries.
Generally it is a good idea to avoid overlapping geometries.
Lines: lines are an important way of representing linear features such as highways and rivers, and are key to network analyses like route optimization. Like polygons, they can quickly become expensive and should be simplified where possible.
Spatial Indexes: spatial indexes currently offer the best performance and costs for visualization and analysis purposes ✨ If you're less familiar with spatial indexes or need a refresher, we have prepared an specific Introduction to Spatial Indexes.
Data warehouse specific optimizations
The techniques to optimize your spatial data are slightly different for each data warehouse provider, so we've prepared specific guides for each of them. Check the ones that apply to you to learn more:
CARTO will automatically detect any missing optimization when you try to use data in Data Explorer or Builder. In most cases, we'll help you apply it automatically, in a new table or in that same table.
Check our Data Explorer documentation for more information.
Optimizing your Google BigQuery data
Make sure your data is clustered by your geometry or spatial index column.
Optimizing your Snowflake data
If your data is points/polygons: make sure Search Optimization is enabled on your geometry column
If your data is based on spatial indexes: make sure it is clustered by your spatial index column.
Optimizing your Amazon Redshift data
If your data is points/polygons: make sure the SRID is set to EPSG:4326
If your data is based on spatial indexes: make sure you're using your spatial index column as the sort key.
Optimizing your Databricks data
Make sure your data uses your H3 column as the z-order.
Optimizing your PostgreSQL data
Make sure your data is indexed by your geometry or spatial index column.
If your data is points/polygons: make sure the SRID is set to EPSG:3857
Optimizing your CARTO Data Warehouse data
Make sure your data is clustered by your geometry or spatial index column.
How CARTO helps you apply these optimizations
As you've seen through this guide, we try our best to automatically optimize the performance and the costs of all analysis, apps, and visualizations made using CARTO. We also provide tools like CARTO Workflows or our Data Explorer UI-assisted optimizations to help you succeed.
Last updated