Calculating market penetration in CPG with merchant universe matching
Last updated
Last updated
Merchant universe matching analysis in CPG consists of matching a company's current distributors (also referred as merchants or customers) to a more extensive set of potential distributors in order to understand their market penetration. A universe is a dataset of merchants that can be collected in-house or provided from an external source. In this analysis, we will be dealing with two of these datasets:
The current universe is the set of merchants that the company currently works with. It is usually an internal relation of known sales accounts.
The total universe is a larger set of potential merchants to be considered. Its source is usually a third party, whether a free source on the internet or a premium data provider.
The objective of this analysis is to generate a mapping from our current universe to the total universe and extract insights from it. This is usually no easy task since the different sources in these datasets mean they have no common index to join them.
In the Analytical Toolbox, different functions are provided to make this analysis easier for the user. This example will present a complete analysis using these functions and the steps involved in it.
For this example, we will work with a small beverage distributor established in Berlin. For that, we will be using the following tables:
A current universe table including different venues like restaurants, bars, hotels, etc., where the product is currently being sold. The names and locations of the venues have been extracted from OpenStreetMap. The table can be found at cartobq.docs.universe_matching_current_universe
.
A total universe from Precisely, whose premium data can be found in the CARTO Data Observatory. The sample used in this example can be found at cartobq.docs.universe_matching_total_universe
.
For this task, we provide the UNIVERSE_MATCHING
procedure in the Analytics Toolbox, which performs a fuzzy match between the two datasets provided. There are several aspects taken into consideration for a match, aiming for the procedure to be as general and robust as possible:
The venues' spatial position is considered as a first filter. By default, only the 60 closest neighbors to a current universe location are checked for matching. There is also a hard limit on the distance between them of 500m. These values are provided as sensible defaults but can be changed in the options
argument should the user need it; please check the SQL reference for further details. The location of the candidates is used to compute the proximity scoring: a value depending on the candidates' distance distribution that lies within the [0, 1] interval.
The name of the locations is the second criterion used in the match. For possible candidates, a text similarity scoring is computed, which measures how similar both names are (also in the [0, 1] interval). The text comparison tries to match strings taking into account different capitalizations and word order. Please refer to the SQL reference for more details on the method used.
These two similarities are consolidated into a single similarity measure, using a weighted average. The user can modify such weights to affect the candidate choosing - it may be needed to emphasize the text similarity over the proximity, or vice versa.
An arbitrary number of columns can be passed to the function, but it requires at least three of them in both the current and total universes, respectively:
A unique ID column that will be used to match against the original query and the resulting pairs.
A location column that will be used to find candidates and compute the proximity scoring.
A name column that will be used to compute the text similarity.
With this in mind, we can run the procedure. In the procedure, there are several parameters we can change; in this case, we will change the similarity weights, to give more importance to the text similarity since we are not confident about our current universe location quality.
Running this procedure will store the results in the <my-project>.<my-dataset>.universe_matching_results
table in BigQuery. Below we can see a sample query on how to consume this table. Note that the resulting table is joined with the original tables to obtain the POI names for a richer comparison:
|
|
|
|
|
|
|
975312181 | D000PIVHYQYW | 0.954844 | 0.958333 | 0.957810 | Mustafa Demir's Gemüsekebab | MUSTAFA DEMIR'S GEMÜSE KEBAP |
438098471 | D000PIT75EJ5 | 0.971799 | 0.933333 | 0.939103 | China-Restaurant Hua Ting | CHINA-RESTAURANT HUA-TING |
2796742907 | D000PIVGNVKY | 0.970360 | 0.923077 | 0.930169 | Kaffee Einstein | EINSTEIN KAFFEE |
1825682816 | D000PIZCQYOT | 0.983886 | 0.909091 | 0.920310 | Holiday Inn - Centre Alexanderplatz | HOLIDAY INN BERLIN-CENTRE ALEXANDERPLATZ |
4761628572 | D000PIVBOMWU | 0.982121 | 0.909091 | 0.920045 | Kantine Volksbühne | VOLKSBÜHNEN KANTINE |
... | ... | ... | ... | ... | ... | ... |
We can observe that the first row corresponds to a match with a 0.957810 similarity score, where both names show some small differences and have a text similarity score of 0.958333, and they are at a physical distance corresponding to a proximity score of 0.954844.
Even though the previous table already provides a vast wealth of insight, we can also use the UNIVERSE_MATCHING_REPORT
procedure to do some more heavy lifting. This procedure will perform some of the most common tasks that we may be interested in during this analysis, namely:
Create a filtered table using a minimum similarity acceptable for each pair.
Create an expansion universe table, including all the rows in the expansion universe that are matched over the minimum similarity threshold.
Create a report, including some useful metrics for this exercise.
To call this procedure, we just need to call:
This procedure will create the three tables mentioned above under the prefix <my-project>.<my-dataset>.universe_matching
. Let us check each of them individually.
The first one, which is returned by the procedure, is the report:
|
|
|
|
|
514 | 2584 | 511 | 2080 | 0.197755 |
Here we can see the following metrics:
The current_universe
is a count of the total number of distributors in our current universe query.
The total_universe
is the number of distributors in our total universe.
The matched_universe
is the number of successfully matched distributors over the confidence threshold. Here we can see how three of our current distributors have not been matched over the similarity threshold.
The expansion_universe
is the number of distributors in the expansion universe that were not matched over the similarity threshold.
The market_penetration
is the ratio of the total universe that is currently covered by our current universe.
This digest is stored in the table <my-project>.<my-dataset>.universe_matching_report
The second table created by this function is <my-project>.<my-dataset>.universe_matching_filtered
; a filtered version of the input <my-project>.<my-dataset>.universe_matching_results
that just contains correctly matched pairs above the minimum threshold passed. That way, we ensure that these matches are of a minimum quality to be trusted. We can take a quick look at these matches in the following map.
The third and last table generated by this function is <my-project>.<my-dataset>.universe_matching_expansion_universe
, which contains all the rows in the total universe query that are not matched over the similarity threshold passed to the function. As per the report, this table will contain 2080 rows. We can use this table to understand different places our business can expand, centralized in a single table.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 960401.