Skip to main content

Introduction to MetricFlow Concepts

This is a short guide to MetricFlow concepts. This guide is for new users and assumes that you have MetricFlow set up and deployed. Check out our deployment docs if you have not completed this step.

The objective of using MetricFlow is to define and manage the logic used to build a company's metrics. MetricFlow is an opinionated set of abstractions that enables the data consumer to retrieve metric datasets efficiently from a data warehouse. There are a few key principles:

  • Flexible, but complete - Define logic in abstractions that are flexible enough to construct any metric on any data model
  • Don't Repeat Yourself (DRY)- Whenever possible enable the definition of metrics
  • Simple with progressive complexity - Rely on known concepts and structures in data modeling to ensure that MetricFlow is approachable. Create escape hatches that enable folks to pursue more advanced and unsupported features.
  • Performant and Efficient- Enable performance optimizations of centralized data engineering while enabling distributed definition and ownership of logic.

What is a metric in the context of Transform?

A metric is a quantitative indicator of the performance of a system such as a product or business.

Technically, a metric can be thought of as a function that takes data and dimensions as input and creates a set of numerical values as output. Those numerical values can be used in a variety of downstream applications. A metric is constructed with the context of a dimensional granularity. Dimensions can be several different types that have different properties such as time or categorical. Similarly, identifiers can be used as a dimension (i.e. revenue by user) but they have the added benefit of being used as join keys that allow us to pull in context from other datasets.

Example Data Model

The example data schema below shows a number of different types of data sources:

transactions is a production DB export that has been cleaned up and organized for analytical consumption

visits is a raw event log

stores is a cleaned up and fully normalized dimensional table from a daily production database export

products is a dimensional table that came from an external source such as a wholesale vendor of the good this store sells.

customers is a partially denormalized table in this case with a column derived from the transactions table through some upstream process

MetricFlow-SchemaExample

In order to make this more concrete let's consider the metric revenue defined (without any dimensions) as the SQL expression select sum(price * quantity) as revenue from transactions

Metric Queries

Dimensions, in this case, can come from the transactions table as a granularity of time such as date_trunc(created_at, 'day') or from other tables such as the product table where we could get the product_category. We could calculate the metric aggregated to both dimensions in the following way:

select
date_trunc(transactions.created_at, 'day') as day
, products.category as product_category
, sum(transactions.price * transactions.quantity) as revenue
from
transactions
join
products
on
transactions.product_id = products.product_id
group by 1, 2

The metric above is revenue but enriched with the context of a time and categorical dimension.

It's likely that the data models you have seen are more complicated than the one above. There are many challenges associated with building metrics in this way across a variety of different data models and the choice of the abstractions in MetricFlow are geared towards enabling a wide variety of data models and types of data.

MetricFlow Overview

MetricFlow aims to capture the metric logic in a way such that it can construct the appropriate queries to build metrics to the many granularities that may be useful for various business applications.

There are two critical objects to the metric definition in MetricFlow: the [Data Source] and Metric. Understanding these will allow you to configure metrics and dimensions in a way that will minimize redundancy while maximizing the potential of MetricFlow on top of your data.

Metrics are built on top of Data Sources and in many cases (but not always, see FAQ) are defined as a function of Measures. Data Sources together produce a data model that can then be compiled and consumed through Transform's various APIs which then constructs SQL on top of the various data sources to resolve metrics at various granularities.

Data Sources (docs)

The input to MetricFlow is a data source. The data source is defined as a Table or a SQL Query. When possible a table is a more efficient input into MetricFlow because it puts more of the logic into the abstractions of MetricFlow which allows for Transform to perform more optimizations on the backend.

The columns in a data source can be used to construct three objects by whatever SQL logic a user wishes to express. The three objects are Identifiers, Measures, & Dimensions.

Identifiers: Act as the join keys between tables and using their types (primary, foreign, or unique) MetricFlow constructs the appropriate joins with other tables. They also have all the properties of a dimension in that they can act as a granularity for aggregation.

Measures: Measures are aggregatable expressions written from the various columns of the underlying table or query. The definition of a measure includes an aggregation type that is then used to construct metrics.

Dimensions: Dimensions are non-aggregatable expressions that are used to define the level of aggregation that a MetricFlow user would like a metric to be aggregated to.

As an example, the transactions table might be used as a data source in MetricFlow as follows:

data_source:
name: transactions
description: A record for every transaction that takes place. Carts are considered multiple transactions for each SKU.
owners: support@transform.co
sql_table: schema.transactions

# --- IDENTIFIERS ---
identifiers:
- name: transaction_id
type: primary
- name: customer_id
type: foreign
- name: store_id
type: foreign
- name: product_id
type: foreign

# --- MEASURES ---
measures:
- name: revenue
description:
expr: price * quantity
agg: sum
- name: quantity
description: Quantity of products sold
expr: quantity
agg: sum
- name: active_customers
description: A count of distinct customers completing transactions
expr: customer_id
agg: count_distinct

# --- DIMENSIONS ---
dimensions:
- name: ds
type: time
expr: date_trunc('day', ts)
type_params:
is_primary: true
time_granularity: day
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end

Similarly we could then create a products data source as follows:

data_source:
name: products
description: A record for every product available through our retail stores.
owners: support@transform.co
sql_table: schema.products

# --- IDENTIFIERS ---
identifiers:
- name: product_id
type: primary

# --- DIMENSIONS ---
dimensions:
- name: category
type: categorical
- name: brand
type: categorical
- name: is_perishable
type: categorical
expr: |
category in ("vegetables", "fruits", "dairy", "deli")

We can now take metrics constructed off of measures in the transactions data source and aggregate them to the granularity of the dimensions in the products data source. As an example, we could use these two data sources to construct the query shown above where we calculate revenue by product__category and day.

Key Concepts

A few key concepts to think about when modeling data:

Aggregation of Measures: MetricFlow can aggregate measures to the granularity of any of the identifiers in the table. If aggregated to the granularity of one of those identifiers, then we can pull in the various dimensions that apply to that identifier adding context to the data.

Dimensions as properties of Primary Identifiers: Dimensions cannot be aggregated and so they are then considered to be a property of the primary or unique identifiers of the table. In the table above, is_bulk_transaction is considered to be an attribute of a transaction_id. This dimension is then useable in MetricFlow by any metric that can be aggregated to the transaction granularity.

Joins are created by Identifier Types: When creating a data source it's important to think about which other data sources will be able to use the data source to combine metrics and dimensions. There are three key types in MetricFlow, primary, unique and foreign, and then 9 potential join pairs. MetricFlow avoids fan-out and chasm joins by avoiding foreign to foreign, primary to foreign, and unique to foreign joins.

Metrics (docs)

A metric can be thought of as a function on top of the various data sources. In the example above, we have created a revenue measure. The simplest type of metric is a measure_proxy which means that the metric is just a measure. One way to define this metric is in the data source as follows:

measures:
- name: revenue
description:
expr: price * quantity
agg: sum
create_metric: true

Or the metric can be defined explicitly with a more complete set of configuration features:

metric:
name: revenue
description: Revenue is price x quantity for every product sold and recorded in the transactions table.
type: measure_proxy
type_params:
measures: revenue

By defining the metric in this way, MetricFlow can then dynamically traverse the various data models constructing SQL queries(like the one shown in the section above) to build metrics datasets.

There are several metric types and the list is constantly growing to support a wider variety of metrics. As an example of a more complicated metric we could define a metric pulling measures from several data sources as follows:

metric:
name: perishables_revenue_per_active_customer
description: Revenue from perishable goods (vegetables, fruits, dairy, deli) for each active store.
type: ratio
type_params:
numerator: revenue
denominator: active_customers
constraints: |
product__category in ("vegetables", "fruits", "dairy", "deli")

Accessing Metrics

MetricFlow exposes a GraphQL API with a number of clients built on top of that including:

  • Command Line Interface (CLI) to pull data locally and improve the dev workflow
  • SQL over JDBC to integrate with BI Tools and other SQL Clients
  • Python Library to pull metrics into Jupyter or other analytical interfaces
  • React Components to build embedded analytics
  • Airflow Operators to schedule API requests and pre-construction of metrics
  • GraphQL interface underlies all of these is also exposed for the end-user to build their own interfaces

Each of the above APIs follows a common format of pulling metrics in the format of "metrics by dimensions." As a simple example to pull revenue by day in the CLI, the user would write the following request

With MetricFlow

mf query --metrics revenue --dimensions ds

With Transform

mql query --metrics revenue --dimensions ds

More can be found in the various APIs docs on more complicated expressions.

Key Concepts

Metrics as functions - A key concept in MetricFlows is that metrics are functions that take in various parameters to define the logical operations of MetricFlow.

Metrics in the context of dimensions - A metric is enriched in MetricFlow using dimensions. Without dimensions, a metric is simply a number for all time and without the context necessary to be useful.

Metrics Interfaces - In order to be a single-source of truth MetricFlow must be able to take in any type of modeled data, construct any metric type and serve that metric to the appropriate place. If we can't, then you'd have to define your metrics elsewhere and we would not be accomplishing our product philosophy. Please send us feature requests for new metric types, data modeling approaches, and APIs!

Further Reading

For further reading on the concepts of MetricFlow please see documentation on:

  • Priming: A guide to how MetricFlow takes a SQL Query data source and builds a candidate data source similar to a SQL Table data source.
  • Query Construction: A guide to how MetricFlow takes various candidate Data Sources and resolves an API request into a SQL Query.
  • Caching: A guide into the various caching mechanisms that allow MetricFlow to resolve queries more efficiently.

FAQ

  • Do my data sets need to be normalized?
    • Not at all! While a cleaned and well-modeled data set can be extraordinarily powerful and is the ideal input, you can use any dataset from raw to fully denormalized datasets.
    • It's recommended that you apply data consistency and quality transformations such as filtering bad data, normalizing common objects, and data modeling of keys and tables in upstream applications. Transform is most efficient at doing data denormalization, rather than normalization
    • If you have not invested in data consistency, that is okay. MetricFlow can take SQL queries or expressions to define consistent datasets.
  • Why is normalized data the ideal input?
    • MetricFlow is built to do denormalization efficiently. There are better tools to take raw datasets and accomplish the various tasks required to build data consistency and organized data models. On the other end, by putting in denormalized data you are potentially creating redundancy which is technically challenging to manage and you are reducing the potential granularity that Transform can use to aggregate metrics.
  • Why not just make metrics the same as measures?
    • One principle of our MetricFlow is to reduce the duplication of logic sometimes referred to as Don't Repeat Yourself(DRY).
    • Many metrics are constructed from reused measures and in some cases constructed from measures from different data sources.
    • Additionally, not all metrics are constructed off of measures. As an example, a conversion metric is likely defined as the presence or absence of an event record after some other event record.
  • How does MetricFlow handle joins?
    • MetricFlow builds joins based on the types of keys and parameters that are passed to identifiers. To better understand how joins are constructed see our documentations on join types.
    • Rather than capturing arbitrary join logic, Transform captures the types of each identifier and then helps the user to navigate to appropriate joins. This allows us to avoid the construction of fan out and chasm joins as well as generate legible SQL.
  • Are identifiers and join keys the same thing?
    • If it helps you to think of identifiers as join keys, that is very reasonable. Identifiers in MetricFlow have applications beyond joining two tables, such as acting as a dimension.
  • Can a table without a primary or unique identifier have dimensions?
    • Yes, but because a dimension is considered an attribute of the primary or unique identifier of the table, they are only useable by the metrics that are defined in that table. They cannot be joined to metrics from other tables. This is common in event logs.