# Calculating market penetration in CPG with merchant universe matching

<div align="left"><figure><img src="https://3015558743-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FFEElAdsRIl9DzfMhbRlB%2Fuploads%2FUx7fNjcfw9KvGNf1JTaW%2Fadvanced%20banner.png?alt=media&#x26;token=ea2ec56e-2c6a-4c54-bae4-561b2fa33b7b" alt="Advanced difficulty banner" width="175"><figcaption></figcaption></figure></div>

Merchant universe matching analysis in CPG consists of matching a company's current distributors (also referred as merchants or customers) to a more extensive set of potential distributors in order to understand their market penetration. A *universe* is a dataset of merchants that can be collected in-house or provided from an external source. In this analysis, we will be dealing with two of these datasets:

* The **current universe** is the set of merchants that the company currently works with. It is usually an internal relation of known sales accounts.
* The **total universe** is a larger set of potential merchants to be considered. Its source is usually a third party, whether a free source on the internet or a premium data provider.

The objective of this analysis is to **generate a mapping from our current universe to the total universe** and extract insights from it. This is usually no easy task since the different sources in these datasets mean they have no common index to join them.

In the Analytical Toolbox, different functions are provided to make this analysis easier for the user. This example will present a complete analysis using these functions and the steps involved in it.

### Defining the current and total universe

For this example, we will work with a small beverage distributor established in Berlin. For that, we will be using the following tables:

* A current universe table including different venues like restaurants, bars, hotels, etc., where the product is currently being sold. The names and locations of the venues have been extracted from [OpenStreetMap](https://carto.com/spatial-data-catalog/browser/dataset/osm_nodes_1ca84971/). The table can be found at `cartobq.docs.universe_matching_current_universe`.
* A total universe from Precisely, whose premium data can be found in the [CARTO Data Observatory](https://carto.com/spatial-data-catalog/browser/dataset/psly_points_of_i_aae44fb4/). The sample used in this example can be found at `cartobq.docs.universe_matching_total_universe`.

{% embed url="<https://clausa.app.carto.com/map/cd438ab1-a5d6-412d-8453-5a5b38ba2816>" %}

### Matching the current and total universe

For this task, we provide the [`UNIVERSE_MATCHING`](https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/broken-reference) procedure in the Analytics Toolbox, which performs a fuzzy match between the two datasets provided. There are several aspects taken into consideration for a match, aiming for the procedure to be as general and robust as possible:

* The **venues' spatial position** is considered as a first filter. By default, only the 60 closest neighbors to a current universe location are checked for matching. There is also a hard limit on the distance between them of 500m. These values are provided as sensible defaults but can be changed in the `options` argument should the user need it; please check the [SQL reference](https://docs.carto.com/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/cpg#universe_matching) for further details. The location of the candidates is used to compute the **proximity scoring**: a value depending on the candidates' distance distribution that lies within the *\[0, 1]* interval.
* The **name of the locations** is the second criterion used in the match. For possible candidates, a **text similarity** scoring is computed, which measures how similar both names are (also in the *\[0, 1]* interval). The text comparison tries to match strings taking into account different capitalizations and word order. Please refer to the [SQL reference](https://docs.carto.com/data-and-analysis/analytics-toolbox-for-bigquery/sql-reference/cpg#universe_matching) for more details on the method used.

These two similarities are consolidated into a single similarity measure, using a weighted average. The user can modify such weights to affect the candidate choosing - it may be needed to emphasize the text similarity over the proximity, or vice versa.

An arbitrary number of columns can be passed to the function, but it requires at least three of them in both the current and total universes, respectively:

* A unique ID column that will be used to match against the original query and the resulting pairs.
* A location column that will be used to find candidates and compute the proximity scoring.
* A name column that will be used to compute the text similarity.

With this in mind, we can run the procedure. In the procedure, there are several parameters we can change; in this case, we will change the similarity weights, to give more importance to the text similarity since we are not confident about our current universe location quality.

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.UNIVERSE_MATCHING(
  -- Current universe
  'cartobq.docs.universe_matching_current_universe',
  'id',
  'venue_name',
  'geom',
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  'name',
  'geom',
  -- Output table
  '<my-project>.<my-dataset>.universe_matching_results'
  -- Optional arguments
  '{"weights": {"text_similarity": 0.85, "proximity": 0.15}}',
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.UNIVERSE_MATCHING(
  -- Current universe
  'cartobq.docs.universe_matching_current_universe',
  'id',
  'venue_name',
  'geom',
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  'name',
  'geom',
  -- Output table
  '<my-project>.<my-dataset>.universe_matching_results'
  -- Optional arguments
  '{"weights": {"text_similarity": 0.85, "proximity": 0.15}}',
);
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.UNIVERSE_MATCHING(
  -- Current universe
  'cartobq.docs.universe_matching_current_universe',
  'id',
  'venue_name',
  'geom',
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  'name',
  'geom',
  -- Output table
  '<my-project>.<my-dataset>.universe_matching_results'
  -- Optional arguments
  '{"weights": {"text_similarity": 0.85, "proximity": 0.15}}',
);
```

{% endtab %}
{% endtabs %}

Running this procedure will store the results in the `<my-project>.<my-dataset>.universe_matching_results` table in BigQuery. Below we can see a sample query on how to consume this table. Note that the resulting table is joined with the original tables to obtain the POI names for a richer comparison:

```sql
SELECT
  results.*,
  current_universe.venue_name AS current_universe_name,
  total_universe.name AS total_universe_name
FROM
  `<my-project>.<my-dataset>.universe_matching_results` results
  INNER JOIN cartobq.docs.universe_matching_current_universe current_universe
  ON results.current_universe_id = current_universe.id
  INNER JOIN cartobq.docs.universe_matching_total_universe total_universe
  ON results.total_universe_id = total_universe.poiid
ORDER BY
  results.similarity DESC
```

| **`current_universe_id`** | **`total_universe_id`** | **`proximity`** | **`text_similarity`** | **`similarity`** | **`current_universe_name`**         | **`total_universe_name`**                |
| ------------------------- | ----------------------- | --------------- | --------------------- | ---------------- | ----------------------------------- | ---------------------------------------- |
| 975312181                 | D000PIVHYQYW            | 0.954844        | 0.958333              | 0.957810         | Mustafa Demir's Gemüsekebab         | MUSTAFA DEMIR'S GEMÜSE KEBAP             |
| 438098471                 | D000PIT75EJ5            | 0.971799        | 0.933333              | 0.939103         | China-Restaurant Hua Ting           | CHINA-RESTAURANT HUA-TING                |
| 2796742907                | D000PIVGNVKY            | 0.970360        | 0.923077              | 0.930169         | Kaffee Einstein                     | EINSTEIN KAFFEE                          |
| 1825682816                | D000PIZCQYOT            | 0.983886        | 0.909091              | 0.920310         | Holiday Inn - Centre Alexanderplatz | HOLIDAY INN BERLIN-CENTRE ALEXANDERPLATZ |
| 4761628572                | D000PIVBOMWU            | 0.982121        | 0.909091              | 0.920045         | Kantine Volksbühne                  | VOLKSBÜHNEN KANTINE                      |
| ...                       | ...                     | ...             | ...                   | ...              | ...                                 | ...                                      |

We can observe that the first row corresponds to a match with a 0.957810 similarity score, where both names show some small differences and have a text similarity score of 0.958333, and they are at a physical distance corresponding to a proximity score of 0.954844.

### Generating a universe-matching report

Even though the previous table already provides a vast wealth of insight, we can also use the [`UNIVERSE_MATCHING_REPORT`](https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/broken-reference) procedure to do some more heavy lifting. This procedure will perform some of the most common tasks that we may be interested in during this analysis, namely:

* Create a filtered table using a minimum similarity acceptable for each pair.
* Create an expansion universe table, including all the rows in the expansion universe that are matched over the minimum similarity threshold.
* Create a report, including some useful metrics for this exercise.

To call this procedure, we just need to call:

{% tabs %}
{% tab title="carto-un" %}

```sql
CALL `carto-un`.carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Output prefix
  '<my-project>.<my-dataset>.universe_matching'
  -- Optional arguments
  '{"min_similarity": 0.6}',
);
```

{% endtab %}

{% tab title="carto-un-eu" %}

```sql
CALL `carto-un-eu`.carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Output prefix
  '<my-project>.<my-dataset>.universe_matching'
  -- Optional arguments
  '{"min_similarity": 0.6}',
);
```

{% endtab %}

{% tab title="manual" %}

```sql
CALL carto.UNIVERSE_MATCHING_REPORT(
  -- Total universe
  'cartobq.docs.universe_matching_total_universe',
  'poiid',
  -- Universe matching results
  '<my-project>.<my-dataset>.universe_matching_results',
  -- Output prefix
  '<my-project>.<my-dataset>.universe_matching'
  -- Optional arguments
  '{"min_similarity": 0.6}',
);
```

{% endtab %}
{% endtabs %}

This procedure will create the three tables mentioned above under the prefix `<my-project>.<my-dataset>.universe_matching`. Let us check each of them individually.

The first one, which is returned by the procedure, is the report:

| **`current_universe`** | **`total_universe`** | **`matched_universe`** | **`expansion_universe`** | **`market_penetration`** |
| ---------------------- | -------------------- | ---------------------- | ------------------------ | ------------------------ |
| 514                    | 2584                 | 511                    | 2080                     | 0.197755                 |

Here we can see the following metrics:

* The `current_universe` is a count of the total number of distributors in our current universe query.
* The `total_universe` is the number of distributors in our total universe.
* The `matched_universe` is the number of successfully matched distributors over the confidence threshold. Here we can see how three of our current distributors have not been matched over the similarity threshold.
* The `expansion_universe` is the number of distributors in the expansion universe that were not matched over the similarity threshold.
* The `market_penetration` is the ratio of the total universe that is currently covered by our current universe.

This digest is stored in the table `<my-project>.<my-dataset>.universe_matching_report`

The second table created by this function is `<my-project>.<my-dataset>.universe_matching_filtered`; a filtered version of the input `<my-project>.<my-dataset>.universe_matching_results` that just contains correctly matched pairs above the minimum threshold passed. That way, we ensure that these matches are of a minimum quality to be trusted. We can take a quick look at these matches in the following map.

{% embed url="<https://clausa.app.carto.com/map/63feda6f-6ac1-405f-832c-1336a41a81d0>" %}

The third and last table generated by this function is `<my-project>.<my-dataset>.universe_matching_expansion_universe`, which contains all the rows in the total universe query that are not matched over the similarity threshold passed to the function. As per the report, this table will contain 2080 rows. We can use this table to understand different places our business can expand, centralized in a single table.

{% embed url="<https://clausa.app.carto.com/map/f06da6b0-c707-421d-9f53-dbc1405034e9>" %}

<img src="https://content.gitbook.com/content/FEElAdsRIl9DzfMhbRlB/blobs/GWtJUWkKFcCzxcPMkgpe/eu-flag-website.png" alt="EU flag" data-size="line"> This project has received funding from the [European Union’s Horizon 2020](https://ec.europa.eu/programmes/horizon2020/en) research and innovation programme under grant agreement No 960401.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://academy.carto.com/advanced-spatial-analytics/spatial-analytics-for-bigquery/step-by-step-tutorials/calculating-market-penetration-in-cpg-with-merchant-universe-matching.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
