Skip to main content

Data Source Inference

caution

This feature is still in Beta ๐Ÿงช. As such, you should not expect it to be 100% stable or be free of bugs. Any public CLI or Python interfaces may change without prior notice.

If you find any bugs or feel like something is not behaving as it should, feel free to open an issue on the MetricFlow Github repo.

When first configuring MetricFlow data sources, it might get tedious and repetitive to manually add entries for each table and each column in your warehouse, especially if you have a large number of them. Data source inference aims to aid you with writing configurations by querying for metadata and statistics about each table/column to infer (hence the name) the correct MetricFlow configs.

important

Inference output might not be 100% correct!

Despite having reasonable defaults, the inference pipeline might not always produce correct results and may, for example, consider a column a dimension when it is really an ID.

You should always review inferred outputs before pushing them to production MetricFlow deployments.

At the end of the day, you're the one who knows the most about your own warehouse and business!

Inference pipeline stepsโ€‹

Data source inference's main goal is to produce working configuration files from information that already exists in your warehouse, with minimal manual effort. This is achieved in 4 steps:

  1. Fetching context: reach out to the data warehouse to get contextual information about each column in the given set of tables.
  2. Running inference rules: run a set of rules over the fetched context to produce signals about each column.
  3. Solving types: given the produced signals, make a final decision about what each column is, e.g, an identifier, a categorical dimension, a field used in a measure.
  4. Rendering results: render the results to config files and possibly the console, in a human-readable way.

All of these steps are glued together with an InferenceRunner class that acts as a "main" function.

Let's take a more detailed look into how each of them work.

Fetching contextโ€‹

An inference pipeline gets context about tables and columns through InferenceContext subclasses. They are constructed and returned by InferenceContextProviders.

Currently, the only available context type is the DataWarehouseInferenceContext. It holds information about tables and their columns, such as the type (integer, double, boolean etc.), nullability and statistics like min/max, distinct count and null count.

Due to implementation details, Snowflake is the only supported data warehouse as of now. We intend on adding support for all the other warehouses in the near future.

Running inference rulesโ€‹

Inference rules are special classes that process input contexts to produce InferenceSignals, pieces of evidence about a column's type. Those signals are associated with a confidence level, a reason, and also include clues about how/when they should be taken into account when considering them alongside other signals during the solving phase.

By default, MetricFlow exports the following rules:

RulePurpose
AnyIdentifierByNameRuleFind columns ending with _?id and signal that they are identifiers.
PrimaryIdentifierByNameRuleFind columns named id or table_?id and signal at they are primary identifiers.
UniqueIdentifierByDistinctCountRuleFind columns with unique values and signal that they might be unique identifiers if another rule classifies them as an identifier.
ForeignIdentifierByCardinalityRatioRuleFind columns with low cardinality to row count ratio and signal that they might be foreign identifiers if another rule classifies them as an identifier. The default ratio threshold is 0.4.
TimeDimensionByTimeTypeRuleFind columns with time type (DATE, DATETIME, TIME, TIMESTAMP) and signal that they are time dimensions.
PrimaryTimeDimensionByNameRuleFind columns named ds, created_at, created_date or created_time and signal that they are primary time dimensions.
PrimaryTimeDimensionIfOnlyTimeRuleFind columns with time type (DATE, DATETIME, TIME, TIMESTAMP) that are the only such columns in their table and signal them as primary time dimensions.
CategoricalDimensionByBooleanTypeRuleFind columns with boolean type and signal that they are categorical dimensions.
CategoricalDimensionByStringTypeRuleFind columns with string (CHAR, VARCHAR, TEXT) type and signal that they are categorical dimensions if another rule classifies them as a dimension.
CategoricalDimensionByIntegerTypeRuleFind columns with integer type and signal that they are categorical dimensions if another rule classifies them as a dimension.
CategoricalDimensionByCardinalityRatioRuleFind columns with low cardinality to row count ratio and signal that they might be categorical dimensions if another rule classifies them as a dimension. The default ratio threshold is 0.2.
MeasureByRealTypeRuleFind columns with real (FLOAT, DOUBLE) type and signal that they are measures.
MeasureByIntegerTypeRuleFind columns with integer type and signal that they are measures if another rule classifies them as a measure.

Solving column typesโ€‹

An InferenceSolver is a class that implements some policy for making a final decision over what a column is/isn't based on all the output signals targeting it. It must be able to reason through confidence values, types and hierarchy information to somehow decide what is or isn't relevant, or decide it is confused.

MetricFlow provides the WeightedTypeTreeInferenceSolver as the default solver. You shouldn't really need to implement a custom solver unless you have a really specific case at hand. If inference is not working properly for you, you probably have misconfigured rules that produce ambiguous output, and the problem is most likely not the solver's fault.

Learn more about how the default solver works and how to set up custom rule sets.

Rendering resultsโ€‹

An InferenceRenderer is a class that writes the resolved results about each table/column to some sort of useful output.

MetricFlow exports two default renderer types:

  • StreamInferenceRenderer: Writes the results to an output stream (file or console) as human-readable text. This is gives a detailed and easy to parse view of why certain results were/weren't produced and the problems that were found during solving. It can come in handy when tuning a rule set for your warehouse.
  • ConfigFileInferenceRenderer: Write inference outputs as valid MetricFlow configuration files. Any uncertain/missed results will generate FIXME comments that you can easily search for to quickly find and fix issues with the generated configs.

Running data source inferenceโ€‹

Make sure you have installed and configured MetricFlow by running mf setup first.

There are two ways of running data source inference: through the MetricFlow CLI or through custom Python scripts.

From the CLIโ€‹

To run data source inference from the CLI, use mf infer:

# - use all tables in schema "YOUR_DB.SCHEMA"
# - write output files in directory "/path/to/inferred/configs"
mf infer --schema YOUR_DB.SCHEMA --output-dir /path/to/inferred/configs

# - use tables "YOUR_DB.SCHEMA_1.TABLE_A" and "YOUR_DB.SCHEMA_2.TABLE_B"
# - sample at most 5000 rows from each table
# - write output files in directory "/path/to/inferred/configs"
mf infer --tables YOUR_DB.SCHEMA_1.TABLE_A,YOUR_DB.SCHEMA_2.TABLE_B --max-sample-size 5000 --output-dir /path/to/inferred/configs

# - use all tables in schema "YOUR_DB.SCHEMA",
# - use scores 1,3,5,7 for confidences LOW,MEDIUM,HIGH,VERY_HIGH
# - use solver threshold of 0.65
# - write output files in directory "/path/to/inferred/configs"
# - overwrite pre-existing files in output directory without failing
mf infer --schema YOUR_DB.SCHEMA --solver-weights 1,3,5,7 --solver-threshold 0.65 --output-dir /path/to/inferred/configs --overwrite

If you need help with CLI usage, run mf infer --help.

From a custom scriptโ€‹

For most use cases, the CLI should be good enough as it provides reasonable defaults and a convenient way to get started with inference. However, you might still want fine-grained control over specific steps in your inference pipeline.

By running data source inference through Python scripts, you can:

  • Instantiate the provided inference rules with granular control over the tuning parameters of each rule and the solver;
  • Implement your own complex inference rules, tailored to your warehouse's specific structure;
  • Tuning the inference solver's weight attribution function;
  • Configure custom rendering/upload pipelines, e.g, uploading configuration files to cloud storage services such as S3 or Google Cloud Storage.

The following is an example inference pipeline that uses the default ruleset and solver, and then renders results to both the standard output and configuration files.

from typing import List

from metricflow.configuration.config_handler import ConfigHandler
from metricflow.dataflow.sql_table import SqlTable
from metricflow.sql_clients.sql_utils import make_sql_client_from_config
from metricflow.sql_clients.snowflake import SnowflakeSqlClient

from metricflow.cli.main import CLIInferenceProgressReporter
from metricflow.inference.context.snowflake import SnowflakeInferenceContextProvider
from metricflow.inference.rule.defaults import DEFAULT_RULESET
from metricflow.inference.solver.weighted_tree import WeightedTypeTreeInferenceSolver
from metricflow.inference.renderer.stream import StreamInferenceRenderer
from metricflow.inference.renderer.config_file import ConfigFileRenderer
from metricflow.inference.runner import InferenceRunner

provider = SnowflakeInferenceContextProvider(
client=make_sql_client_from_config(ConfigHandler()),
tables=[
SqlTable.from_string("YOUR_DB.SCHEMA.YOUR_TABLE"),
SqlTable.from_string("YOUR_DB.SCHEMA.YOUR_OTHER_TABLE"),
],
# sample at most 10000 rows to manage costs
max_sample_size=10000
)

runner = InferenceRunner(
context_providers=[provider],
ruleset=DEFAULT_RULESET,
solver=WeightedTypeTreeInferenceSolver(),
renderers=[
# write inference results in a human-readable format to the console
StreamInferenceRenderer.stdout(),
# write inference results to config files at the specified path
ConfigFileRenderer("/path/to/inferred/configs", overwrite=True),
],
# report progress the same way the CLI does
# (easy way of getting pretty prints for each inference step)
progress_reporter=CLIInferenceProgressReporter(),
)

# run the inference pipeline
runner.run()

Check out how to implement custom inference rules if you need to.