Calculating market penetration in CPG with merchant universe matching

Merchant universe matching analysis in CPG consists of matching a company's current distributors (also referred as merchants or customers) to a more extensive set of potential distributors in order to understand their market penetration. A universe is a dataset of merchants that can be collected in-house or provided from an external source. In this analysis, we will be dealing with two of these datasets:

  • The current universe is the set of merchants that the company currently works with. It is usually an internal relation of known sales accounts.

  • The total universe is a larger set of potential merchants to be considered. Its source is usually a third party, whether a free source on the internet or a premium data provider.

The objective of this analysis is to generate a mapping from our current universe to the total universe and extract insights from it. This is usually no easy task since the different sources in these datasets mean they have no common index to join them.

In the Analytical Toolbox, different functions are provided to make this analysis easier for the user. This example will present a complete analysis using these functions and the steps involved in it.

Defining the current and total universe

For this example, we will work with a small beverage distributor established in Berlin. For that, we will be using the following tables:

  • A current universe table including different venues like restaurants, bars, hotels, etc., where the product is currently being sold. The names and locations of the venues have been extracted from OpenStreetMap. The table can be found at cartobq.docs.universe_matching_current_universe.

  • A total universe from Precisely, whose premium data can be found in the CARTO Data Observatory. The sample used in this example can be found at cartobq.docs.universe_matching_total_universe.

Matching the current and total universe

For this task, we provide the UNIVERSE_MATCHING procedure in the Analytics Toolbox, which performs a fuzzy match between the two datasets provided. There are several aspects taken into consideration for a match, aiming for the procedure to be as general and robust as possible:

  • The venues' spatial position is considered as a first filter. By default, only the 60 closest neighbors to a current universe location are checked for matching. There is also a hard limit on the distance between them of 500m. These values are provided as sensible defaults but can be changed in the options argument should the user need it; please check the SQL reference for further details. The location of the candidates is used to compute the proximity scoring: a value depending on the candidates' distance distribution that lies within the [0, 1] interval.

  • The name of the locations is the second criterion used in the match. For possible candidates, a text similarity scoring is computed, which measures how similar both names are (also in the [0, 1] interval). The text comparison tries to match strings taking into account different capitalizations and word order. Please refer to the SQL reference for more details on the method used.

These two similarities are consolidated into a single similarity measure, using a weighted average. The user can modify such weights to affect the candidate choosing - it may be needed to emphasize the text similarity over the proximity, or vice versa.

An arbitrary number of columns can be passed to the function, but it requires at least three of them in both the current and total universes, respectively:

  • A unique ID column that will be used to match against the original query and the resulting pairs.

  • A location column that will be used to find candidates and compute the proximity scoring.

  • A name column that will be used to compute the text similarity.

With this in mind, we can run the procedure. In the procedure, there are several parameters we can change; in this case, we will change the similarity weights, to give more importance to the text similarity since we are not confident about our current universe location quality.

CALL `carto-un`.carto.UNIVERSE_MATCHING(
  -- Current universe
  'cartobq.docs.universe_matching_current_universe',
  'id',
  'venue_name',
  'geom',
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  'name',
  'geom',
  -- Output table
  '<my-project>.<my-dataset>.universe_matching_results'
  -- Optional arguments
  '{"weights": {"text_similarity": 0.85, "proximity": 0.15}}',
);

Running this procedure will store the results in the <my-project>.<my-dataset>.universe_matching_results table in BigQuery. Below we can see a sample query on how to consume this table. Note that the resulting table is joined with the original tables to obtain the POI names for a richer comparison:

SELECT
  results.*,
  current_universe.venue_name AS current_universe_name,
  total_universe.name AS total_universe_name
FROM
  `<my-project>.<my-dataset>.universe_matching_results` results
  INNER JOIN cartobq.docs.universe_matching_current_universe current_universe
  ON results.current_universe_id = current_universe.id
  INNER JOIN cartobq.docs.universe_matching_total_universe total_universe
  ON results.total_universe_id = total_universe.poiid
ORDER BY
  results.similarity DESC

current_universe_id

total_universe_id

proximity

text_similarity

similarity

current_universe_name

total_universe_name

975312181

D000PIVHYQYW

0.954844

0.958333

0.957810

Mustafa Demir's Gemüsekebab

MUSTAFA DEMIR'S GEMÜSE KEBAP

438098471

D000PIT75EJ5

0.971799

0.933333

0.939103

China-Restaurant Hua Ting

CHINA-RESTAURANT HUA-TING

2796742907

D000PIVGNVKY

0.970360

0.923077

0.930169

Kaffee Einstein

EINSTEIN KAFFEE

1825682816

D000PIZCQYOT

0.983886

0.909091

0.920310

Holiday Inn - Centre Alexanderplatz

HOLIDAY INN BERLIN-CENTRE ALEXANDERPLATZ

4761628572

D000PIVBOMWU

0.982121

0.909091

0.920045

Kantine Volksbühne

VOLKSBÜHNEN KANTINE

...

...

...

...

...

...

...

We can observe that the first row corresponds to a match with a 0.957810 similarity score, where both names show some small differences and have a text similarity score of 0.958333, and they are at a physical distance corresponding to a proximity score of 0.954844.

Generating a universe-matching report

Even though the previous table already provides a vast wealth of insight, we can also use the UNIVERSE_MATCHING_REPORT procedure to do some more heavy lifting. This procedure will perform some of the most common tasks that we may be interested in during this analysis, namely:

  • Create a filtered table using a minimum similarity acceptable for each pair.

  • Create an expansion universe table, including all the rows in the expansion universe that are matched over the minimum similarity threshold.

  • Create a report, including some useful metrics for this exercise.

To call this procedure, we just need to call:

CALL `carto-un`.carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Output prefix
  '<my-project>.<my-dataset>.universe_matching'
  -- Optional arguments
  '{"min_similarity": 0.6}',
);

This procedure will create the three tables mentioned above under the prefix <my-project>.<my-dataset>.universe_matching. Let us check each of them individually.

The first one, which is returned by the procedure, is the report:

current_universe

total_universe

matched_universe

expansion_universe

market_penetration

514

2584

511

2080

0.197755

Here we can see the following metrics:

  • The current_universe is a count of the total number of distributors in our current universe query.

  • The total_universe is the number of distributors in our total universe.

  • The matched_universe is the number of successfully matched distributors over the confidence threshold. Here we can see how three of our current distributors have not been matched over the similarity threshold.

  • The expansion_universe is the number of distributors in the expansion universe that were not matched over the similarity threshold.

  • The market_penetration is the ratio of the total universe that is currently covered by our current universe.

This digest is stored in the table <my-project>.<my-dataset>.universe_matching_report

The second table created by this function is <my-project>.<my-dataset>.universe_matching_filtered; a filtered version of the input <my-project>.<my-dataset>.universe_matching_results that just contains correctly matched pairs above the minimum threshold passed. That way, we ensure that these matches are of a minimum quality to be trusted. We can take a quick look at these matches in the following map.

The third and last table generated by this function is <my-project>.<my-dataset>.universe_matching_expansion_universe, which contains all the rows in the total universe query that are not matched over the similarity threshold passed to the function. As per the report, this table will contain 2080 rows. We can use this table to understand different places our business can expand, centralized in a single table.

Last updated