Dimensions
Dimensions Overview
Dimensions are non-aggregatable expressions that are used to define the level of aggregation that a MetricFlow user would like a metric to be aggregated to. You can also think of dimensions as the various data cuts (the slices and dices) that you would like to see a metric grouped by. Dimensions cannot be aggregated, so they are then considered to be a property of the primary or unique identifiers of the table.
Dimensions are defined within data sources along with measures and identifiers. You can think of dimensions as the columns in your data table or query that are non-aggregatable — they provide categorical or time-based context to enrich queries, and are usually included in the GROUP BY clause of your SQL query.
Consider the following data source as the basis for the examples to follow.
data_source:
name: transactions
description: A record for every transaction that takes place. Carts are considered multiple transactions for each SKU.
owners: support@transform.co
sql_table: schema.transactions
# --- IDENTIFIERS ---
identifiers:
- name: transaction_id
type: primary
- name: customer_id
type: foreign
- name: store_id
type: foreign
- name: product_id
type: foreign
# --- MEASURES ---
measures:
- name: revenue
description:
expr: price * quantity
agg: sum
# --- DIMENSIONS ---
dimensions:
- name: ds
type: time
expr: date_trunc('day', ts)
type_params:
is_primary: true
time_granularity: day
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end
All dimensions require a name
, type
and in most cases, an expr
parameter.
name
refers to the name for the dimension, which will be exposed to the end user. Thename
parameter can also act as an alias if the underlying column or SQL query reference is a different string within theexpr
parameter.- Note that dimension names must be unique within a data source, but they are not required to be unique across different data sources. MetricFlow will identify the correct dimension via the proper join identifiers. For further discussion on join paths, see the Join logic page.
type
refers to the type of dimension created within the data source. Current types include:- Categorical: Usually a way to group rows within a table, such as geography, product type, color, etc.
- Time: Time dimensions will point to an underlying date field in the warehouse. All time dimensions must be TIMESTAMP or the DB engine equivalent.
expr
(optional) refers to the underlying column or SQL query to define the dimension. If noexpr
parameter is present for a dimension, MetricFlow will look for a column that matches thename
parameter.- For valid inputs into the
expr
parameter: you may input the column name itself, or a valid SQL expression such as a case statement.
- For valid inputs into the
Dimensions Types
Categorical Dimensions
Category dimensions allow metrics to be grouped by different "category" slices and dices, such as product type, color, or geographical area. These may point to existing columns in the data table, or be calculated using a SQL expression with the expr
parameter.
As different business functional areas may have different ways to categorize the data, it can be very valuable to create new dimensions to capture those bespoke groupings.
In the example file provided below, is_bulk_transaction
is a categorical dimension where we have provided a case statement for further grouping based on the underlying column quantity
. is_bulk_transaction
will then be available to the end user to filter or group by in the UI.
Example:
dimensions:
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end
Time Dimensions
tip
Time dimension columns must be datetime data type if you are using BigQuery as your Data Warehouse. If they are stored as timestamp or another data type, you can cast the dimensions to datatime in the expr
property. i.e expr: cast(timestamp as datetime)
All time dimensions must be TIMESTAMP or the DB engine equivalent.
Time dimensions are used to aggregate metrics against different levels of time granularity. Current granularities supported by MetricFlow include day, week, month, quarter, and year (see the parameter time_granularity
for further details).
Time dimensions in MetricFlow have a few more parameters, which are specified under the type_params
section, as seen in the example below.
Example:
dimensions:
- name: created_at
type: time
expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table
is_partition: True
type_params:
is_primary: True
time_granularity: day
- name: deleted_at
type: time
expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table
is_partition: True
type_params:
is_primary: False
time_granularity: day
measures:
- name: users_deleted
expr: 1
agg: sum
agg_time_dimension: deleted_at
create_metric: True
- name: users_created
expr: 1
agg: sum
create_metric: True
Within type_params
:
is_primary
: Set theis_primary
parameter toTrue
to indicate if a time dimension is to be the primary, or preferred, time dimension for a measure or metric.Note that a primary time dimension is required for data sources that contain measures. The primary time dimension is used for graphing the x-axis within the Transform UI. For materializations, the primary time dimension is used as the common time dimension for all metrics. If you have other time dimensions in your data source, these time dimensions will require
is_primary: False
.To assign a non-primary time dimension to measures in MetricFlow, specify an
agg_time_dimension
parameter for the measure and refer to any time dimension that is defined in the dimension section.In the example above, we have a data source with two different time dimensions defined as
created_at
anddeleted_at
.- Within the data source, the time dimension
created_at
has been specified as the primary time dimension viais_primary: True
. - The time dimension
deleted_at
has been specified as the non-primary time dimension viais_primary: False
. - The measure
users_created
defaults to the primary time dimension (created_at
) - On the other hand, the measure
users_deleted
utilizes the time dimensiondeleted_at
as its time dimension via the parameteragg_time_dimension
.
- Within the data source, the time dimension
Below, the CLI Command example shows querying a single metric by its time dimension. For a single metric, the default time dimension is the specified primary time dimension, which may be referred to as
metric_time
or the time dimension's name itself.- MetricFlow
# Both CLI queries below will return `users_created` by the primary time dimension `created_at`)
mf query --metrics users_created --dimensions metric_time --order metric_time
mf query --metrics users_created --dimensions created_at --order created_at
# Specify a time dimension by name if you choose to query the ßmeasure with a time dimension that has not been specified as its agg_time_dimension
mf query --metrics users_deleted --dimensions created_at --order created_at- Transform
# Both CLI queries below will return `users_created` by the primary time dimension `created_at`)
mql query --metrics users_created --dimensions metric_time --order metric_time
mql query --metrics users_created --dimensions created_at --order created_at
# Specify a time dimension by name if you choose to query the measure with a time dimension that has not been specified as its agg_time_dimension
mql query --metrics users_deleted --dimensions created_at --order created_atBelow, the CLI Command example shows querying multiple metrics by differing time dimensions. The MetricFlow framework recognizes the multiple time dimensions that are referenced by two separate metrics:
users_created
utilizes the primary time dimensioncreated_at
whileusers_deleted
utilizes time dimensiondeleted_at
. The dimension argumentmetric_time
will identify the necessary primary time dimensions and agg_time_dimensions and join the metric values together based on the conjoined time spines.- MetricFlow:
mf query --metrics users_created,users_deleted --dimensions metric_time --order metric_time
- Transform:
mql query --metrics users_created,users_deleted --dimensions metric_time --order metric_time
time_granularity
: Time granularity refers to the minimum granularity that a measure or metric should be reported by. For example, a metric that is specified with weekly granularity would not be allowed to be shown with daily granularity.- Current options for time granularity are day, week, month, quarter, and year.
- Moreover, the specified time granularity allows for proper aggregation when querying metrics of different granularities. For example, when querying two metrics (one with a DAY granularity and another with a MONTH granularity), MetricFlow will only return a result with the MONTH granularity by default.
- Within the Transform UI, the specified time granularity also allows graphs to render correctly (e.g. a DAY granularity cannot be specified for a MONTHLY metric).
is_partition
: To assert that a dimension exists over a specific time window(e.g. a date-partitioned dimensional table), add theis_partition: True
parameter to the dimension. MetricFlow will add this date to any joins on this table to ensure that the correct dimensional values are joined to measures in other tables.
MetricFlow also allows for easy aggregation of metrics on the fly via the time-granularity option in the CLI. Suppose we have a measure named messages_per_month
where the original time_granularity
of the time dimension metric_time
was set to monthly as seen below:
Example:
data_source:
name: messages_month
description: Each row represents a message truncated to the month.
owners: support@transform.co
sql_table: schema.messages
# --- IDENTIFIERS ---
identifiers:
- name: message_id
type: primary
- name: user_id_from
type: foreign
- name: user_id_to
type: foreign
- name: message_user_from_to # Composite key
type: unique
identifiers:
- ref: message_id
- ref: user_id_from
- ref: user_id_to
# --- MEASURES ---
measures:
- name: messages_by_month
description: The total number of messages sent by month
expr: 1
agg: sum
create_metric: True
# --- DIMENSIONS ---
dimensions:
- name: ds
type: time
expr: date_trunc('month', ts) #Underlying column is named “ts”
type_params:
is_primary: true
time_granularity: month
The CLI command example below shows aggregating the messages_per_month
metric up to yearly via the time-granularity
option:
MetricFlow:
mf query --metrics messages_per_month --dimensions metric_time --order metric_time --time-granularity year
Transform:
mql query --metrics messages_per_month --dimensions metric_time --order metric_time --time-granularity year
Slowly Changing Dimensions
The MetricFlow framework allows for the ability to use joins against dimension values in a versioned data source built on top of an SCD (Slowly Changing Dimension) Type II warehouse table. This is particularly useful when you need a particular metric sliced by a dimension that changes over time, such as the historical trends of sales by a customer's country.
Currently, MetricFlow supports the basic structure of an SCD Type II warehouse table as such:
entity_key | dimension_1 | dimension_2 | ... | dimension_x | valid_from | valid_to |
---|
entity_key
(required): An entity_key (or some sort of identifier) must be presentvalid_from
(required): A timestamp indicating the start of a changing dimension value must be presentvalid_to
(required): A timestamp indicating the end of a changing dimension value must be present
Below is a sample SQL to demonstrate the join between a sample metric (num_events
, which serves as a count of events) and versioned dimension data (scd_dimensions
, which serves as our SCD Type II table) via a natural key (comprised of the entity_key/identifier and timestamp columns):
SELECT metric_time, dimension_1, SUM(1) AS num_events
FROM events a
LEFT OUTER JOIN scd_dimensions b
ON
a.entity_key = b.entity_key
AND a.metric_time >= b.valid_from
AND (a.metric_time < b. valid_to OR b.valid_to IS NULL)
GROUP BY 1, 2
caution
When a dimension value changes, the underlying SCD dimensions table is required to show valid_to
= valid_from
for the MetricFlow framework to properly work.
As a concrete example, note the example SCD table below. Sales_person_id 111 experiences a promotion from tier 1 to tier 2 on 2020-01-05, and similarly sales_person_id = 333 experiences a promotion from tier 2 to tier 3 on 2021-10-22. To designate the current dimension, you may indicate a date far into the future or input NULL as the ending date.
Example SCD configuration: Suppose we have the underlying SCD table:
sales_person_id | tier | start_date | end_date |
---|---|---|---|
111 | 1 | 2019-02-03 | 2020-01-05 |
111 | 2 | 2020-01-05 | 2048-01-01 |
222 | 2 | 2020-03-05 | 2048-01-01 |
333 | 2 | 2020-08-19 | 2021-10-22 |
333 | 3 | 2021-10-22 | 2048-01-01 |
Below is the sample data source configuration to create slowly changing dimensions.
- Note the extra arguments under
validity_params
: useis_start
to indicate the beginning timestamp column for a dimension value, andis_end
to indicate the ending timestamp column for a dimension value.
data_source:
name: sales_person_tiers
description: SCD Type II table of tiers for sales people
owners:
- support@transformdata.io
sql_table: schema.sales_tiers
dimensions:
- name: tier_start
type: time
expr: start_date
type_params:
time_granularity: day
validity_params:
is_start: True
- name: tier_end
type: time
expr: end_date
type_params:
time_granularity: day
validity_params:
is_end: True
- name: tier
type: categorical
identifiers:
- name: sales_person
type: natural
expr: sales_person_id
Below, we have a separate data source that holds a fact table for transactions
:
data_source:
name: transactions
description: |
Each row represents one transaction.
There is a transaction, product, sales_person, and customer id for
every transaction. There is only one transaction id per
transaction. The `ds` or date is reflected in UTC.
owners:
- support@transformdata.io
sql_table: demo_dbt.transactions_base
dbt_model: demo_dbt.transactions_base
identifiers:
- name: transaction_id
type: primary
- name: customer
type: foreign
expr: customer_id
- name: product
type: foreign
expr: product_id
- name: sales_person
type: foreign
expr: sales_person_id
- name: sales_person_customer
type: foreign
identifiers:
- ref: customer
- ref: sales_person
measures:
- name: transactions
expr: 1
agg: SUM
create_metric: True
- name: gross_sales
expr: sales_price
agg: SUM
create_metric: True
- name: sales_persons_with_a_sale
expr: sales_person_id
agg: COUNT_DISTINCT
create_metric: True
dimensions:
- name: ds
type: time
is_partition: true
type_params:
is_primary: True
time_format: YYYY-MM-DD
time_granularity: day
- name: sales_geo
type: categorical
You can now access the metrics in the data source transactions
split by the slowly changing dimension of tier
.
Example: Return the count of transactions generated by each sales tier per month
# MetricFlow
mf query --metrics transactions --dimensions metric_time__month,sales_person__tier --order metric_time__month --order sales_person__tier
# Transform
mql query --metrics transactions --dimensions metric_time__month,sales_person__tier --order metric_time__month --order sales_person__tier
caution
Currently, there are limitations with Transform's support of SCDs.
As their name suggests, slowly changing dimensions do change values within a coarser time granularity, thus naturally resulting in a range of valid rows with different dimension values for a given metric/measure. Within a coarser time window (i.e. month), the MetricFlow framework associates the metric with the first (minimum) available dimension value. In other words, the MetricFlow framework currently defaults to the dimension that is valid at the beginning of the time granularity. The team at Transform is aware of this limitation, and is including the advanced functionality of choosing the valid time dimension within a time window in the near future roadmap
Example 1: In the sales tier example mentioned above, suppose sales_person_id 123 is Tier 1 from 2022-03-01 to 2022-03-12, and gets promoted to Tier 2 from 2022-03-12 onwards. For the month of March, all transactions associated with sales_person_id 123 will be grouped under Tier 1 since the dimension value of Tier 1 comes earlier (and is the default starting dimension) than Tier 2.
Example 2: Using the same sales tier example, suppose sales_person_id 456 is Tier 2 from 2022-03-08 and onwards. The MetricFlow framework recognizes that sales_person_id 456 technically did not have an associated tier level dimension from 2022-03-01 to 2022-03-08. Therefore, for the month of March, all transactions associated with sales_person_id 456 will be grouped under 'NA' since no tier is present (2022-03-01 to 2022-03-08) prior to Tier 2 (2022-03-08 onwards).
The team at Transform recognizes these limitations on SCDs, and is working hard to allow more flexibility with SCDs, including but not limited to:
- Configurable joins (where one may configure how a metric is to be associated with a dimension within a coarser time window)
- Support measures in SCD data sources