Skip to main content

Data Sources

Data Source YAMLs

For each data source, create a configuration YAML in your Transform repository.

A data source is usually a table within your data warehouse. However, you can create as many data sources out of each table you want.

See the example below for a full config and see below for a more detailed description of each field.

Example:

data_source: # This section includes metadata on the data source and the source definition.
name: transactions # Define the name of the source. This name can be updated in the future.
description: | # Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description.
# It is only surfaced in the UI under the lineage view on the metric page.
This table captures every transaction starting July
02, 2014. Each row represents one transaction. There
will be a new row for any cancellations or alterations.
There is a transaction, order, and user id for
every transaction. There is only one transaction id per
transaction, but there can be many rows per order id and
user id. The `ds` or date is reflected in UTC.
owners: # List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.
- support@transformdata.io
sql_table: demo_schema.transactions # The table from which the data source is constructed. You can also pass in the databse name first.
# sql_query: You can also pass in a SQL Query here. This allows you to aggregate or filter the data before passing it into Transform.

identifiers:
# Define the primary, foreign, or unique key columns in your source that can be used to join to other data sources.
# Each join key should have a type of either primary, foreign, or unique.
# Additionally, keys can be referenced directly by column name from data source or they can be altered using an expression.
- name: id_transaction
type: primary
- name: id_order
type: foreign
- name: id_user
type: foreign
expr: SUBSTRING(id_order from 2)

measures:
# Define the fields from your data source to be aggregated as inputs to metrics (e.g. in metrics.yaml).
# Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression.
# Measures have a default expression of their name. The available aggregations are sum, max, min, count_distinct, and sum_boolean.
- name: transaction_amount_usd
description: The total USD value of the transaction.
agg: sum
- name: transactions
description: The total number of transactions.
expr: "1"
agg: sum
- name: quick_buy_amount_usd
description: The total USD value of the transactions that were
purchased using the “quick buy” button.
expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END
agg: sum
- name: quick_buy_transactions
description: The total transactions bought as quick buy.
expr: quick_buy_flag
agg: sum_boolean

dimensions:
# Define the dimensions from your source. Dimensions are qualitative values such as names, dates, or geographical data.
# Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing.
# Dimensions can either directly reference a column or may be calculated using a SQL expression.
- name: ds
type: time
type_params:
is_primary: # true or false
time_granularity: day # only format currenlty supported
- name: quick_buy_transaction
type: categorical
expr: |
CASE
WHEN transaction_type_name = ‘quick_buy’ THEN
ELSE ‘not_quick_buy’
END

mutability:
# Define the appropriate mutability for this data source. Mutability refers to how the data underlying this configuration (a sql table or the results of a sql query) changes.
# The following are the available options and a brief description of the type of data they describe.
type: append_only
type_params:
along: ts

Data Source

This section includes metadata on the data source and the source definition.

Name

Define the name of the source. This name can be updated in the future.

Description

Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description. It is only surfaced in the UI under the lineage view on the metric page.

Example:

description: |
This table captures every transaction starting July
02, 2014. Each row represents one transaction. There
will be a new row for any cancellations or alterations.
There is a transaction, order, and user id for
every transaction. There is only one transaction id per
transaction, but there can be many rows per order id and
user id. The `ds` or date is reflected in UTC.

Owners

List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.

Example:

owners:
- support@transformdata.io

sql_table or sql_query

There are two types of Data Sources: sql_table or sql_query.

Tip: A Table is preferred over a SQL Query as it allows us to perform more efficient queries against source tables. Where possible use a Table and move logic to the expr

If using sql_table, simply add the schema and table in the warehouse.

Example:

sql_table: demo_schema.transactions

if you want to reference the database in datawarehouses such as Snowflake, you can do so directly in the sql_table or sql_query:

sql_table: demo_db.demo_schema.transactions

If using a full query, use your local data warehouse query language.

Example:

sql_query: |
SELECT
t.id AS id_transaction
, t.id_order
, t.id_user
, t.transaction_amount
, t.transaction_amount_usd
, tt.transaction_type_name
, t.ds
FROM
demo_schema.transactions t
JOIN
demo_schema.transaction_type tt
ON
tt.id = t.id_transaction_type

dbt_model

If you use dbt for data transformations, you can use your dbt models as data sources. Follow the integration setup and you'll be able to include your dbt model in the dbt_model parameter. You must still specify a value in the sql_table or sql_query parameter.

Example:

sql_query: select * from demo.calls
dbt_model: demo_dbt.calls_base

Identifiers

Define the primary, foreign, or unique key columns in your source that can be used to join to other data sources. Each join key should have a type of either primary, foreign, or unique. Additionally, keys can be referenced directly by column name from data source or they can be altered using an expression.

Identifier Types:

Primary: A primary key has one and only one record for each row in the table and it is also inclusive of every record in the data warehouse

Unique: **A unique key has one and only one record for each row in the table, but it may only have a subset of every record in the data warehouse. They can also have nulls.

Foreign: A foreign key can have zero, one or many of the same records. They can also have nulls

Example:

identifiers:
- name: transaction
type: primary
expr: id_transaction
- name: order
type: foreign
expr: id_order
- name: user
type: foreign
expr: SUBSTRING(id_order FROM 2)

Note about Dimensions and Identifiers

Dimensions are properties of Primary or Unique Identifiers: Dimensions cannot be aggregated and so they are then considered to be a property of the primary or unique identifiers of the table. In the table above, is_bulk_transaction is considered to be an attribute of a transaction_id. This dimension is then useable in the MetricFlow by any metric that can be aggregated to the transaction granularity.

Composite Keys

data source: users
identifiers:
- name: message
expr: message_id
type: foreign
- name: user_message
type: primary
identifiers:
- ref: message
- name: user_id

To configure a composite key

  • Name the composite key using the same name field as when naming any other identifier. In this example's case, it's user_message
  • ref may be specified instead to reference another identifier, in which case the name and expr are inherited from the referenced identifier
  • Define the fields that make up your key ahead of defining the key itself.
  • Specify the composite primary key via the identifier type primary.
  • The fields that make up the key should be listed under identifiers.

Note: If keys that make up an identifier cannot stand alone or be joined to other data sources, as in this example, we recommend that you define those fields directly in the composite statement as we did with message. Notice in this example that we first only defined one field before the composite key definition (message_id) and defined the other (user_id) directly in the composite key.

You can still choose to explicitly define all identifiers that make up your key ahead of defining the composite primary key itself, and our recommended approach is to consider whether the field can stand alone or whether it is only usable as part of a composite key.

Note: Transform will never implicitly create a composite key. If a data source happens to have team_id and user_id, we won't assume that data source can be joined with user_team unless explicitly defined.

Measures

Define the fields from your data source to be aggregated as inputs to metrics (e.g. in metrics.yaml). Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression. Measures have a default expression of their name. The available aggregations are sum, max, min, count_distinct, and sum_boolean.

Example:

measures:
- name: transaction_amount_usd
description: The total USD value of the transaction.
agg: sum
- name: transactions
description: The total number of transactions.
expr: "1"
agg: sum
- name: quick_buy_amount_usd
description: The total USD value of the transactions that were
purchased using the “quick buy” button.
expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END
agg: sum
- name: quick_buy_transactions
description: The total transactions bought as quick buy.
expr: quick_buy_flag
agg: sum_boolean

Dimensions

Define the dimensions from your source. Transform currently supports dimensions of type time and categorical. Dimensions are qualitative values such as names, dates, or geographical data. Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing. Dimensions can either directly reference a column or may be calculated using a SQL expression.

Example:

dimensions:
- name: ds
type: time
type_params:
is_primary: # true or false
time_granularity: day
- name: quick_buy_transaction
type: categorical
expr: |
CASE
WHEN transaction_type_name = ‘quick_buy’
THEN 1
ELSE 0
END

Time dimensions

It's a requirement to add the type parameters all time dimensions. In addition, for data sources with a measure, a primary time dimension is required.

  • The primary time dimension is used for graphing the x-axis. For materializations, it's used as the common time dimension for all metrics.
  • Granularity currently supports day, week, month, quarter, year.

NOTE: The name of your primary time dimension must be the same across all your data sources.

For data sources that use a SQL query, MetricFlow makes an optimization to prevent the query from executing multiple times if not necessary. A common example of the benefit of this optimization is if multiple metrics need data from a single data source, the query is not run multiple times to retrieve the data from those metrics. This process where this table is produced is called priming, and it's beneficial especially when queries may be doing full table scans and are expensive to run. In certain cases, some configuration in the data source is needed to tell MetricFlow how to prime in order to get the best performance.

Mutability Configuration

caution

An incorrect mutability setting will lead to inaccurate metric values. To avoid this, be sure to understand the various options and the update process for the underlying dataset

Mutability refers to how the underlying data in this configuration (a sql table or the results of a sql query) changes.

The options are as follows:

immutable: the data source rarely changes and MetricFlow only runs the query once.

full mutation: the data changes frequently and we will update it on a periodic basis.

append_only: the data source will update with only new rows appended to it, and there's a mutability column that indicates when there are new rows.

ds_append_only: this is similar to append_only except that the mutability column that indicates there are new rows is by default ds.

For more information, please visit Mutability Settings.