Monitoring data quality at scale using Monte Carlo (2024)

Create Monte Carlo monitors as code.

Gilboa Reif

Published in

Vimeo Engineering Blog

8 min read

Feb 24, 2022

Automatic monitors

Automatic monitors are monitors that are set upfront to any data set in your data warehouse. You don’t need to worry about creating these monitors whenever you introduce a new data set. These monitors, which include freshness, volume, and schema, might sound basic, but they do a great job in catching staleness in your data sets, suspicious drops or increases in the volume of data, and schema changes.

The main reason that these monitors are created automatically is that they consume a relatively low level of computing resources due to their nature of leveraging metadata versus actual data. Monte Carlo collects metadata continuously from Snowflake’s information schema. It uses metadata such as LAST_ALTERED, ROW_COUNTand BYTES from the table information_schema.tables to monitor freshness and size. It uses GET_DDL() to monitor schema changes. Check out the Monte Carlo blog to get a sneak peek of how Monte Carlo leverages metadata for the most basic monitors.

Recently an outage in a bunch of backend Big Picture events at Vimeo was caught by a size anomaly monitor (see Figure 3). One of my colleagues might have more to say about what Big Picture is in an upcoming post for the Vimeo Engineering Blog.

Monitoring data quality at scale using Monte Carlo (5)

Custom monitors

The other type of monitors are custom monitors. These monitors are created manually, since they’re potentially more compute-intensive processes; I’ll explain to you how to scale this option using monitors as code. They’re also more sophisticated than the basic automatic monitors.

There are four types of custom monitors: field health, dimension tracking, JSON schema, and SQL rules. I’ll elaborate on the two types our team uses the most:

Field health monitors. A field health monitor collects metrics on each column of your data set and alerts when any of them is anomalous. These metrics include percentage of null values, percentage of unique values, and for quantitative columns summary statistics such as minimum, maximum, average, standard deviation, and percentiles.
Dimension tracking monitors. A dimension tracking monitor is best suited for low-cardinality fields. It alerts you if the distribution of values is changing significantly. We have found custom monitors to be useful in catching incidents and bugs that we can’t anticipate. As opposed to tests by Great Expectations or dbt where you know what to expect or know what you want to test over time, these monitors do a great job of catching everything you don’t know you need to test. For example, if the null percentage on a certain column is anomalous, this might be a proxy of a deeper issue that is more difficult to anticipate and test.

These monitors can be created relatively easily via a wizard in the Monte Carlo UI (see Figure 4).

Monitoring data quality at scale using Monte Carlo (6)

One example of this being useful is setting such a monitor on the platform for a data set that collects signups in a cross-platform product. Supposing in your SaaS cross-platform product, usually, 60 percent of users are onboarded from the web, 30 percent from iOS, and 10 percent from Android.

Now say that this ratio changes to 65 percent iOS, 25 percent web, and 10 percent Android (see Figure 5):

Monitoring data quality at scale using Monte Carlo (7)

In this example, iOS signups have increased at the expense of web signups. This is suspicious and can imply that something is broken with web data collection, or something on the web product is broken, leading to fewer leads converting into users.

As a team, we wanted to leverage Monte Carlo’s monitors as code functionality. This feature enables the easy creation of monitors based on configurations stored in a YML file. Creating a Monte Carlo monitor is as simple as creating or updating a YML file with the monitor’s config, like this:

This unleashes the power of their ML-based monitors in a scalable, reproducible manner. Internally our team discussed two complementary approaches for this:

Integrate into dbt, taking advantage of how it fits well with a dbt schema.ymlfile. For more information, see the Monte Carlo docs.
Create a standalone repo for creating Monte Carlo monitors, decoupled from dbt and for any use case.

We ended up deciding on the second option, the standalone repo, to start. In general, monitors created on dbt models fit into the first option, and everything else fits the second option. This includes raw data ingested into Snowflake and data models that are transformed without dbt. The main reason we started with the second option is because at the time we were still shaping the way we wanted to implement dbt, and it seemed simpler to start with a standalone tool and when the time comes, integrate this tool into dbt, with the lessons learned from the standalone tool. You can think of this one as a centralized “terraform for monitors.” The same way cloud resources are managed and deployed as terraform code (that is, infrastructure as code), this repo enables anyone for any use case to create, get a review, and deploy a Monte Carlo monitor.

We have defined a CI/CD process with Jenkins as follows:

For the more graphically inclined, the flowchart for this process, which I’ll describe below, appears in Figure 6.

Monitoring data quality at scale using Monte Carlo (8)

Here’s how the process works.

The engineer or analyst who would like to create a Monte Carlo monitor is responsible for creating their monitor’s configuration in a YML file and testing it locally using montecarlo apply --namespace $MC_MONITORS_NAMESPACE --dry-run

This does the following:

Fail if the YML file is misconfigured because of indentation and so on.
Print out what the new configuration is going to deploy to the Monte Carlo platform. You can think of this as an equivalent of the `terraform plan` command.

See Figure 7 for an output example.

Monitoring data quality at scale using Monte Carlo (9)

Once the config is tested locally, open a pull request, or PR. This runs a few things by Jenkins:

Build a Docker image.
Install requirements such as montecarlodata.
Run montecarlo apply --namespace $MC_MONITORS_NAMESPACE --dry-run .
Add a comment to the PR with the dry-run output, making it easy for the reviewer to understand what this PR is going to deploy to the Monte Carlo platform.

See Figure 8 for a typical PR comment added by Jenkins.

Monitoring data quality at scale using Monte Carlo (10)

Finally, once the PR is approved and merged, Jenkins deploys the changes into Monte Carlo by running montecarlo apply --namespace $MC_MONITORS_NAMESPACE.

We have recently implemented dbt Cloud at Vimeo. Monte Carlo monitors can easily be coupled with dbt by adding Monte Carlo monitor configurations under the key montecarloin the dbt schema.ymlfile. This enables us to suggest that engineers and analysts create Monte Carlo monitors during their dbt model creation PR. In addition, it makes a lot of sense to have all information regarding a dbt model in one place, including documentation, tests, and Monte Carlo monitors.

We also tend to route Monte Carlo notifications to specific Slack channels. This way, the right team gets notified, with the right context. Otherwise, these alert notifications are just white noise. Another enhancement in this manner can be to add a test to Jenkins that checks if the monitors are going to be routed to a specific channel or to the default channel that holds them all, which in our case is #montecarlo-monitors.

I hope this blog has inspired you to take the next step in monitoring data quality at scale. To achieve a culture of data quality it is going to take more than a single engineer and I would say more than a single team. As data engineers, we would like to build tools and frameworks to make it easy for anyone to contribute to data quality, and this is one step.