Skip to main content

Data Source Configuration Best Practices

Introduction

Transform takes the physical data layer of your data warehouse (rows and columns) and creates a semantic data layer on top of it. MetricFlow creates this semantic layer based on your config files that describe how the data is stored in the data warehouse. This configuration is accomplished in YAML data source configuration files. These are a list of common questions and best practices when attempting to optimize the performance of MetricFlow.

When should I use a SQL Table vs SQL Query?

SQL Table:

The table name in your data warehouse.

data_source:
name: data_source_name
description: My description
owners:
- owner_email1@company.com
sql_table: schema.table

SQL Query:

A query that runs against your data warehouse.

data_source:
name: data_source_name
description: My description
owners:
- owner_email1@company.com
sql_query: Select * from schema.table

Under the hood:

SQL Table is always preferred to a SQL Query because it pushes more logic construction into MetricFlow and enables more optimization. A SQL Query is materialized in a feature called "priming" to a table and then that table is treated as a SQL Table in MetricFlow.

Best Practices:

  • Whenever possible use sql_table. This will be substantially more performant.
  • Whenever possible avoid joins in a sql_query. Instead, create a new data source. Transform's semantic layer will automate the joins for you!
  • If a sql_query is used, consider the most efficient mutability setting. By default, the primed table will be considered invalid on an hourly basis which makes an efficient mutability setting even more important.

What is an appropriate mutation setting for my data_source ?

Under the hood:

The mutability setting determines when the cache considers a data set invalid or when it needs to look at the underlying dataset to pull fresh data.

Full Mutation

By default, MetricFlow will assume all data constructed off of a full_mutation is invalid every hour. If you know when datasets are typically updated, say by 12 AM UTC each day, you should set an update_cron for 1 PM UTC.

mutability:
type: full_mutation
type_params:
# 6:00 Pacific Time or 9:00 Eastern Time is 13:00 UTC
update_cron: 0 13 * * *

Immutable

If you decide to leverage the invalidate command with Airflow, set your mutability to immutable to prevent the cache from invalidating twice.

mutability:
type: immutable

Best Practices

  • If you use full_mutation specify an update_cron time to schedule the priming. If you do not, a priming query will be run every hour which can be expensive and may impact the experience your end users have consuming data through the APIs.

When do I use denormalized vs normalized tables?

Best Practices

  • Either works! But, you will likely get more from MetricFlow by using normalized tables as data sources. A normalized table will maximize your ability to produce metrics to a variety of granularities consistently.
  • A denormalized table is already aggregated and therefore cannot produce metrics to nearly as many granularities.
  • On the other hand, a raw table may not be consistent with other data sets in your warehouse.
  • Transform's semantic layer is capable of most denormalization tasks (aggregation, filtering, joining) and we're constantly seeking to fill any gaps.

Example

For an Example Data Source and Metrics written against a data model using these best practices, see the Introduction to MetricFlow