Skip to main content

Dimensions

Dimensions Overview

Dimensions are non-aggregatable expressions that are used to define the level of aggregation that a MetricFlow user would like a metric to be aggregated to. You can also think of dimensions as the various data cuts (the slices and dices) that you would like to see a metric grouped by. Dimensions cannot be aggregated, so they are then considered to be a property of the primary or unique identifiers of the table.

Dimensions are defined within data sources along with measures and identifiers. You can think of dimensions as the columns in your data table or query that are non-aggregatable — they provide categorical or time-based context to enrich queries, and are usually included in the GROUP BY clause of your SQL query.

Consider the following data source as the basis for the examples to follow.

data_source:
name: transactions
description: A record for every transaction that takes place. Carts are considered multiple transactions for each SKU.
owners: support@transform.co
sql_table: schema.transactions

# --- IDENTIFIERS ---
identifiers:
- name: transaction_id
type: primary
- name: customer_id
type: foreign
- name: store_id
type: foreign
- name: product_id
type: foreign

# --- MEASURES ---
measures:
- name: revenue
description:
expr: price * quantity
agg: sum

# --- DIMENSIONS ---
dimensions:
- name: ds
type: time
expr: date_trunc('day', ts)
type_params:
is_primary: true
time_granularity: day
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end

All dimensions require a name, type and in most cases, an expr parameter.

  • name refers to the name for the dimension, which will be exposed to the end user. The name parameter can also act as an alias if the underlying column or SQL query reference is a different string within the expr parameter.

    • Note that dimension names must be unique within a data source, but they are not required to be unique across different data sources. MetricFlow will identify the correct dimension via the proper join identifiers. For further discussion on join paths, see the Join logic page.
  • type refers to the type of dimension created within the data source. Current types include:

    • Categorical: Usually a way to group rows within a table, such as geography, product type, color, etc.
    • Time: Time dimensions will point to an underlying date field in the warehouse. All time dimensions must be TIMESTAMP or the DB engine equivalent.
  • expr (optional) refers to the underlying column or SQL query to define the dimension. If no expr parameter is present for a dimension, MetricFlow will look for a column that matches the name parameter.

    • For valid inputs into the expr parameter: you may input the column name itself, or a valid SQL expression such as a case statement.

Dimensions Types

Categorical Dimensions

Category dimensions allow metrics to be grouped by different "category" slices and dices, such as product type, color, or geographical area. These may point to existing columns in the data table, or be calculated using a SQL expression with the expr parameter.

As different business functional areas may have different ways to categorize the data, it can be very valuable to create new dimensions to capture those bespoke groupings.

In the example file provided below, is_bulk_transaction is a categorical dimension where we have provided a case statement for further grouping based on the underlying column quantity. is_bulk_transaction will then be available to the end user to filter or group by in the UI.

Example:

dimensions: 
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end

Time Dimensions

tip

Time dimension columns must be datetime data type if you are using BigQuery as your Data Warehouse. If they are stored as timestamp or another data type, you can cast the dimensions to datatime in the expr property. i.e expr: cast(timestamp as datetime) All time dimensions must be TIMESTAMP or the DB engine equivalent.

Time dimensions are used to aggregate metrics against different levels of time granularity. Current granularities supported by MetricFlow include day, week, month, quarter, and year (see the parameter time_granularity for further details).

Time dimensions in MetricFlow have a few more parameters, which are specified under the type_params section, as seen in the example below.

Example:

dimensions: 
- name: created_at
type: time
expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table
is_partition: True
type_params:
is_primary: True
time_granularity: day
- name: deleted_at
type: time
expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table
is_partition: True
type_params:
is_primary: False
time_granularity: day

measures:
- name: users_deleted
expr: 1
agg: sum
agg_time_dimension: deleted_at
create_metric: True
- name: users_created
expr: 1
agg: sum
create_metric: True

Within type_params:

  • is_primary: Set the is_primary parameter to True to indicate if a time dimension is to be the primary, or preferred, time dimension for a measure or metric.

    • Note that a primary time dimension is required for data sources that contain measures. The primary time dimension is used for graphing the x-axis within the Transform UI. For materializations, the primary time dimension is used as the common time dimension for all metrics. If you have other time dimensions in your data source, these time dimensions will require is_primary: False.

    • To assign a non-primary time dimension to measures in MetricFlow, specify an agg_time_dimension parameter for the measure and refer to any time dimension that is defined in the dimension section.

      • In the example above, we have a data source with two different time dimensions defined as created_at and deleted_at.

        • Within the data source, the time dimension created_at has been specified as the primary time dimension via is_primary: True.
        • The time dimension deleted_at has been specified as the non-primary time dimension via is_primary: False.
        • The measure users_created defaults to the primary time dimension (created_at)
        • On the other hand, the measure users_deleted utilizes the time dimension deleted_at as its time dimension via the parameter agg_time_dimension.
      • Below, the CLI Command example shows querying a single metric by its time dimension. For a single metric, the default time dimension is the specified primary time dimension, which may be referred to as metric_time or the time dimension's name itself.

        • MetricFlow
        # Both CLI queries below will return `users_created` by the primary time dimension `created_at`)
        mf query --metrics users_created --dimensions metric_time --order metric_time
        mf query --metrics users_created --dimensions created_at --order created_at

        # Specify a time dimension by name if you choose to query the ßmeasure with a time dimension that has not been specified as its agg_time_dimension
        mf query --metrics users_deleted --dimensions created_at --order created_at
        • Transform
        # Both CLI queries below will return `users_created` by the primary time dimension `created_at`)
        mql query --metrics users_created --dimensions metric_time --order metric_time
        mql query --metrics users_created --dimensions created_at --order created_at

        # Specify a time dimension by name if you choose to query the measure with a time dimension that has not been specified as its agg_time_dimension
        mql query --metrics users_deleted --dimensions created_at --order created_at

      • Below, the CLI Command example shows querying multiple metrics by differing time dimensions. The MetricFlow framework recognizes the multiple time dimensions that are referenced by two separate metrics: users_created utilizes the primary time dimension created_at while users_deleted utilizes time dimension deleted_at. The dimension argument metric_time will identify the necessary primary time dimensions and agg_time_dimensions and join the metric values together based on the conjoined time spines.

        • MetricFlow:
        mf query --metrics users_created,users_deleted --dimensions metric_time --order metric_time 
        • Transform:
        mql query --metrics users_created,users_deleted --dimensions metric_time --order metric_time 
  • time_granularity: Time granularity refers to the minimum granularity that a measure or metric should be reported by. For example, a metric that is specified with weekly granularity would not be allowed to be shown with daily granularity.

    • Current options for time granularity are day, week, month, quarter, and year.
    • Moreover, the specified time granularity allows for proper aggregation when querying metrics of different granularities. For example, when querying two metrics (one with a DAY granularity and another with a MONTH granularity), MetricFlow will only return a result with the MONTH granularity by default.
    • Within the Transform UI, the specified time granularity also allows graphs to render correctly (e.g. a DAY granularity cannot be specified for a MONTHLY metric).
  • is_partition: To assert that a dimension exists over a specific time window(e.g. a date-partitioned dimensional table), add the is_partition: True parameter to the dimension. MetricFlow will add this date to any joins on this table to ensure that the correct dimensional values are joined to measures in other tables.

MetricFlow also allows for easy aggregation of metrics on the fly via the time-granularity option in the CLI. Suppose we have a measure named messages_per_month where the original time_granularity of the time dimension metric_time was set to monthly as seen below:

Example:

data_source:
name: messages_month
description: Each row represents a message truncated to the month.
owners: support@transform.co
sql_table: schema.messages

# --- IDENTIFIERS ---
identifiers:
- name: message_id
type: primary
- name: user_id_from
type: foreign
- name: user_id_to
type: foreign
- name: message_user_from_to # Composite key
type: unique
identifiers:
- ref: message_id
- ref: user_id_from
- ref: user_id_to

# --- MEASURES ---
measures:
- name: messages_by_month
description: The total number of messages sent by month
expr: 1
agg: sum
create_metric: True

# --- DIMENSIONS ---
dimensions:
- name: ds
type: time
expr: date_trunc('month', ts) #Underlying column is named “ts”
type_params:
is_primary: true
time_granularity: month

The CLI command example below shows aggregating the messages_per_month metric up to yearly via the time-granularity option:

MetricFlow:

mf query --metrics messages_per_month --dimensions metric_time --order metric_time --time-granularity year  

Transform:

mql query --metrics messages_per_month --dimensions metric_time --order metric_time --time-granularity year  

Slowly Changing Dimensions

The MetricFlow framework allows for the ability to use joins against dimension values in a versioned data source built on top of an SCD (Slowly Changing Dimension) Type II warehouse table. This is particularly useful when you need a particular metric sliced by a dimension that changes over time, such as the historical trends of sales by a customer's country.

Currently, MetricFlow supports the basic structure of an SCD Type II warehouse table as such:

entity_keydimension_1dimension_2...dimension_xvalid_fromvalid_to
  • entity_key (required): An entity_key (or some sort of identifier) must be present
  • valid_from (required): A timestamp indicating the start of a changing dimension value must be present
  • valid_to (required): A timestamp indicating the end of a changing dimension value must be present

Below is a sample SQL to demonstrate the join between a sample metric (num_events, which serves as a count of events) and versioned dimension data (scd_dimensions, which serves as our SCD Type II table) via a natural key (comprised of the entity_key/identifier and timestamp columns):

SELECT metric_time, dimension_1, SUM(1) AS num_events
FROM events a
LEFT OUTER JOIN scd_dimensions b
ON
a.entity_key = b.entity_key
AND a.metric_time >= b.valid_from
AND (a.metric_time < b. valid_to OR b.valid_to IS NULL)
GROUP BY 1, 2
caution

When a dimension value changes, the underlying SCD dimensions table is required to show valid_to = valid_from for the MetricFlow framework to properly work.

As a concrete example, note the example SCD table below. Sales_person_id 111 experiences a promotion from tier 1 to tier 2 on 2020-01-05, and similarly sales_person_id = 333 experiences a promotion from tier 2 to tier 3 on 2021-10-22. To designate the current dimension, you may indicate a date far into the future or input NULL as the ending date.

Example SCD configuration: Suppose we have the underlying SCD table:

sales_person_idtierstart_dateend_date
11112019-02-032020-01-05
11122020-01-052048-01-01
22222020-03-052048-01-01
33322020-08-192021-10-22
33332021-10-222048-01-01

Below is the sample data source configuration to create slowly changing dimensions.

  • Note the extra arguments under validity_params: use is_start to indicate the beginning timestamp column for a dimension value, and is_end to indicate the ending timestamp column for a dimension value.
data_source:
name: sales_person_tiers
description: SCD Type II table of tiers for sales people
owners:
- support@transformdata.io

sql_table: schema.sales_tiers

dimensions:
- name: tier_start
type: time
expr: start_date
type_params:
time_granularity: day
validity_params:
is_start: True
- name: tier_end
type: time
expr: end_date
type_params:
time_granularity: day
validity_params:
is_end: True
- name: tier
type: categorical

identifiers:
- name: sales_person
type: natural
expr: sales_person_id

Below, we have a separate data source that holds a fact table for transactions:

data_source: 
name: transactions
description: |
Each row represents one transaction.
There is a transaction, product, sales_person, and customer id for
every transaction. There is only one transaction id per
transaction. The `ds` or date is reflected in UTC.

owners:
- support@transformdata.io

sql_table: demo_dbt.transactions_base
dbt_model: demo_dbt.transactions_base

identifiers:
- name: transaction_id
type: primary
- name: customer
type: foreign
expr: customer_id
- name: product
type: foreign
expr: product_id
- name: sales_person
type: foreign
expr: sales_person_id
- name: sales_person_customer
type: foreign
identifiers:
- ref: customer
- ref: sales_person

measures:
- name: transactions
expr: 1
agg: SUM
create_metric: True
- name: gross_sales
expr: sales_price
agg: SUM
create_metric: True
- name: sales_persons_with_a_sale
expr: sales_person_id
agg: COUNT_DISTINCT
create_metric: True

dimensions:
- name: ds
type: time
is_partition: true
type_params:
is_primary: True
time_format: YYYY-MM-DD
time_granularity: day
- name: sales_geo
type: categorical

You can now access the metrics in the data source transactions split by the slowly changing dimension of tier.

Example: Return the count of transactions generated by each sales tier per month

# MetricFlow 
mf query --metrics transactions --dimensions metric_time__month,sales_person__tier --order metric_time__month --order sales_person__tier

# Transform
mql query --metrics transactions --dimensions metric_time__month,sales_person__tier --order metric_time__month --order sales_person__tier
caution

Currently, there are limitations with Transform's support of SCDs.

As their name suggests, slowly changing dimensions do change values within a coarser time granularity, thus naturally resulting in a range of valid rows with different dimension values for a given metric/measure. Within a coarser time window (i.e. month), the MetricFlow framework associates the metric with the first (minimum) available dimension value. In other words, the MetricFlow framework currently defaults to the dimension that is valid at the beginning of the time granularity. The team at Transform is aware of this limitation, and is including the advanced functionality of choosing the valid time dimension within a time window in the near future roadmap

Example 1: In the sales tier example mentioned above, suppose sales_person_id 123 is Tier 1 from 2022-03-01 to 2022-03-12, and gets promoted to Tier 2 from 2022-03-12 onwards. For the month of March, all transactions associated with sales_person_id 123 will be grouped under Tier 1 since the dimension value of Tier 1 comes earlier (and is the default starting dimension) than Tier 2.

Example 2: Using the same sales tier example, suppose sales_person_id 456 is Tier 2 from 2022-03-08 and onwards. The MetricFlow framework recognizes that sales_person_id 456 technically did not have an associated tier level dimension from 2022-03-01 to 2022-03-08. Therefore, for the month of March, all transactions associated with sales_person_id 456 will be grouped under 'NA' since no tier is present (2022-03-01 to 2022-03-08) prior to Tier 2 (2022-03-08 onwards).

The team at Transform recognizes these limitations on SCDs, and is working hard to allow more flexibility with SCDs, including but not limited to:

  • Configurable joins (where one may configure how a metric is to be associated with a dimension within a coarser time window)
  • Support measures in SCD data sources