Using Transform with Normalized vs Denormalized Data
We’ve been asked several times, “What is the best data warehouse design for Transform?” This article seeks to answer that question, and outlines what to keep in mind when using Transform with different types of warehouse designs.
Before diving in, it’s important to understand where Transform sits in your data stack. Transform is a metrics layer that sits on top of your data warehouse. You model entities via YAML configs, which MetricFlow builds into a data model. Transform uses this data model to generate SQL that is executed against your warehouse.
The key thing to keep in mind is that the MetricFlow data model will reflect the underlying structure of your warehouse, and will work with both normalized and denormalized data sets.
There are some important considerations to keep in mind, depending on how your data is structured, that will impact what your Transform configs look like. We’ll break down the things to keep in mind for a normalized warehouse design and a denormalized warehouse design
Here is a summary of some considerations to make when implementing Transform:
- Transform sits on top of your warehouse and mirrors your data warehouse design
- Working with normalized data will yield the most flexible data model, allowing you to slice metrics by all available dimensions.
- Transform can work with denormalized data sets, but you may lose out on some flexibility since join paths are defined explicitly in the transformation layer.
- Building Data Sources from summary tables is the least flexible data model
- Performance depends on the size of your datasets and query access patterns. There are caching and materialization strategies to improve performance on large data sets.
Using Transform with a Normalized Warehouse Design
Many data teams will organize data in their warehouse following the principle of normalization i.e. data is not repeated in different places.
This typically looks like sets of fact tables and dimension tables. A common design pattern is a snowflake schema, with a fact table in the middle, and a bunch of dimension tables around it connected by a foreign to primary key relationship. We’ll use this type of schema design in our example.
You can think of Transform as a denormalization engine. What this means is once you model the normalized tables in your warehouse, Transform will generate SQL that joins these tables together at query time depending on the metric query we receive. The generated SQL is creating a denormalized data set. One thing that is different in Transform than other semantic models you may have used is you do not explicitly specify join paths between tables, instead you define foreign and primary keys as identifiers, and MetricFlow infers all the possible join paths in the data model. Another term we use for this is “Data Mart Builder'' as we are essentially building out data marts at query time to serve different metric and dimension requests.
These are the steps to implementing Transform on top of a snowflake schema. First, you need to model the underlying data tables in YAML config files that Transform can understand. Each table would be a Data Source object in Transform. For example, a Data Source representing a fact table would contain measures, a primary key, and the foreign keys for all the dimensional tables that can join to this fact table. You do not explicitly specify join paths, but instead model foreign and primary keys within data sources. The identifier types (primary, unique and foreign) specify how tables will join together.
# Purchases Data Source
- name: user_id
- name: amount
- name: purchase_date
Since we’re modeling a fact table, the Data Source will mostly contain measures, which are columns you will aggregate over to create metrics. Although there typically aren’t dimensions in a fact table, you will need to define the time dimension that we will aggregate measures by.
Your dimension tables will also be Data Sources. They will likely only have one or two identifiers since they only join one fact table. They will also contain dimensions, but typically there won’t be measures in these data sources.
The final Transform data model object will look like this.
This design is super flexible, as you can ask for any metric by any number of dimensions. Working with a normalized database design will give you the most flexibility with your metrics. However, Transform still works well if you have denormalized tables in your warehouse.
Using Transform with a Denormalized Warehouse Design
You might not be working with normalized data in your warehouse. Many teams create denormalized, wide tables for reporting with database views, dbt or another transformation tool. Join logic between tables is expressed explicitly, and usually saved at the transformation level.
To model these tables in Transform, you would create a Data Source based on the denormalized table. This Data Source would include identifiers, and both the measures and dimensions defined in this table. This Data Source will likely be larger than a normalized table as denormalized tables are usually fairly wide. Metrics can then be built off of the measures defined in this table and dimensions are available for slicing and dicing within the same Data Source.
One downside is the joins are pre-defined, so you lose out on flexibility in your data model if there are dimensions you did not explicitly specify in your denormalized table. A way around this is to model the joined tables as a separate Data Source and define additional dimensions, so they can be joined in at query time, similar to the normalized design.
Another thing to consider is the grain of your denormalized table. You’ll have the least level of flexibility with pre-aggregated summary tables, as the grain will likely be a date (i.e a daily engagement summary). You won’t be able to join in other data sources because the identifiers are not included in the rollup table. You still get the benefit of governed metric definitions and metric definitions defined in code, but limit the set of questions that can be answered by your end users.
Transform should be performant with both denormalized and normalized data warehouse design. Performance depends more on the size of the underlying tables, and the compute available in your warehouse. There are performance optimizations you can make with materializations and caching for large datasets that work well for all types of database designs.