Join Logic
Joins are a powerful part of MetricFlow and allow you to make all valid dimensions available for your metrics on the fly, even if they are defined in a different data source. Joins also allow you to construct metrics using measures from different data sources.
MetricFlow uses the identifiers
defined in your data source configs as the join keys between tables. Assuming the identifiers are set, MetricFlow can perform joins under the hood by creating a graph using the data sources as nodes and the join paths as edges. MetricFlow will generate the appropriate join type and avoid fan-out or chasm joins with other tables based on the identifier types.
tip
You do NOT have to create any new data sources or configuration files to utilize joins. MetricFlow will generate the appropriate joins for you under-the-hood. This doc serves to demonstrate the allowed join-types via identifiers and how to query joins via the CLI.
Types of Joins
MetricFlow implements most joins as left joins, and does not allow fan-out and chasm joins. Please reference the table below to identify which joins are/are not allowed based on specific identifier types in order to avoid constructing dangerous joins.
Identifier type - Table A | Identifier type - Table B | Join type |
---|---|---|
Primary | Primary | Left |
Primary | Unique | Left |
Primary | Foreign | Fanout (Not Allowed) |
Unique | Primary | Left |
Unique | Unique | Left |
Unique | Foreign | Fanout (Not Allowed) |
Foreign | Primary | Left |
Foreign | Unique | Left |
Foreign | Foreign | Fanout (Not Allowed) |
Example:
This example uses two data sources with a common identifier and shows a MetricFlow query that requires a join between the two data sources.
Say you have two data sources, transactions
and user_signup
as seen below:
data_source:
name: transactions
identifiers:
- name: id
type: primary
- name: user_id
type: foreign
measures:
- name: average_purchase_price
agg: avg
expr: purchase_price
create_metric: true
data_source:
name: user_signup
identifiers:
- name: user_id
type: primary
dimensions:
- name: type
type: categorical
Under the hood, MetricFlow will use user_id
as the join key between the data sources transactions
and user_signup
. As a result, we can now query for the metric average_purchase_price
(in the data source transactions
) to be sliced via the dimension type
from data source user_signup
.
Additionally, notice that the measure average_purchase_price
is defined in the data source transactions
(which has user_id
as a foreign identifier). Furthermore, the data source user_signup
has user_id
as a primary identifier. As a result, this is a foreign to primary relationship, and implemented as a left join (data source transactions
left joins data source user_signup
, as the measure average_purchase_price
is defined in the data source transactions
).
When querying dimensions from different data sources via the CLI, the convention is to prefix the dimension name with a dunder (double underscore) following the joining identifier in order to avoid ambiguous join paths. The CLI query below includes user_id__type
for the argument dimensions
; the query is specified to jump from data source transactions
(where the metric average_purchase_price
is defined) to the data source user_signup
via the identifier user_id
to obtain the dimension type
.
mql query --metrics average_purchase_price --dimensions metric_time,user_id__type
note
If you have multiple dimensions with the same name, it would be possible for an ambiguous join path to be created. To avoid this, use a dimension which includes the prefixed identifier that's meant to disambiguate where the dimension comes from. You can see the fully scoped dimension name by running mql list-metrics
in the CLI.
Multi-Hop Joins
Transform allows users to join measures and dimensions across a graph of identifiers. We call this a 'multi-hop join' as a user can 'hop' across one table to another.
Observe the schema below for the following example discussion:
Notice how this schema can be translated into the 3 MetricFlow data sources below to create the metric 'Average Purchase Price by Country' using the purchase_price
measure from the sales table and the country_name
dimension from the country_dim
table.
data_source:
name: sales
identifiers:
- name: id
type: primary
- name: user_id
type: foreign
measures:
- name: average_purchase_price
agg: avg
expr: purchase_price
create_metric: true
dimensions:
- name: ds
type: time
type_params:
is_primary: true
data_source:
name: user_signup
identifiers:
- name: user_id
type: primary
- name: country_id
type: Unique
dimensions:
- name: signup_date
type: time
type_params:
is_primary: true
data_source:
dame: country_dim
identifiers:
- name: country_id
type: primary
dimensions:
- name: country_name
type: categorical
Using Multi-Hop Joins
To query dimensions without a multi-hop join involved, you use the dimension name prefixed with a dunder (double underscore) and identifier. The same paradigm applies for dimensions retrieved by a multi-hop join, except you may have multiple sets of dundered identifiers, since you are hopping across multiple data sources.
As an example -- suppose you are interested in viewing the metric average_purchase_price
split by the dimension country_name
. Based on the schema specified in the data sources, MetricFlow knows to involve a multi-hop of (1) from data source sales
to data source user_signup
via identifier user_id
; (2) from data source user_signup
to data source country_dim
via identifier country_id
to obtain the desired final dimension of country_name
.
To query this multi-hop join, the dimension is specified with 2 sets of dunders as user_id__country_id__country_name
to reflect the 2 identifiers (user_id
and country_id
):
MetricFlow
mf query --metrics average_purchase_price dimensions --metric_time,user_id__country_id__country_name
Transform
mql query --metrics average_purchase_price dimensions --metric_time,user_id__country_id__country_name