Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It FreeData quality is always a hot topic – it’s a top priority yet it remains a key concern for the vast majority of data organizations. Ironically, there are many tools that address data quality however it’s not always easy to understand what’s the right solution for your organization.
In this blog post we’ll explore Snowflake’s native data quality solution, called Data Quality Monitoring. We’ll review how it works and provide a step-by-step guide to create your first Data Metric Functions – DMFs – which are the tests themselves that are used by this product. We think that becoming familiarized with Data Quality Monitoring and DMFs is critical knowledge for any organization running on Snowflake, and we’ll also explore pros and cons against third-party solutions.
Let’s get started!
Earlier this year, Snowflake introduced Data Quality Monitoring and Data Metric Functions (DMFs) as a powerful framework for data quality monitoring natively within Snowflake. DMFs allow you to set up custom, pre-defined rules (Called System DMFs) to validate data within Snowflake tables and columns. These rules are automatically enforced, helping you maintain high data quality across your Snowflake environment.
As of recently, Data Quality Monitoring and DMFs are officially supported as part of the Enterprise Edition, after several months in Public Preview.
Snowflake’s data quality solution provides several key benefits over third-party tools:
Of course, there are also pros for using third-party vendors for data quality:
If your organization is using Snowflake extensively, Data Quality Monitoring and DMFs could offer a seamless, cost-effective, and most importantly, real-time solution for data quality.
In the following section we’ll walk you through setting up your first DMFs to upgrade your Snowflake data quality.
Start by creating Customers
and Orders
tables and filling them with sample data. Run the following in your Snowflake console:
Next, create DMFs to validate specific fields. Let’s consider the following examples:
order_status
field is an enum that must be one of the following values: pending
, in progress
, completed
.country
field must be a 2-character uppercase string.email
field must contain a valid email format i.e., <address>@<domain>
Here’s the SQL code for these validation rules:
Let’s schedule our validations to run on every change made to the table. The TRIGGER_ON_CHANGES
schedule will run the DMF whenever rows in the table are changed, and specifically when rows are inserted, deleted or updated (via INSERT
, UPDATE
or any other DML operation).
This part is important - This scheduling option is unique to DMFs and is not available in most observability and/or monitoring tools that run on a predetermined cadence. This allows Snowflake DMFs to detect problems immediately as they occur, without needing to wait for the next scheduled run which can be hours later. Increasing the cadence is usually non-viable since these runs can get expensive when done by a third-party. We therefore strongly believe that this option is a game changer for how fast data issues can get detected, and reduce the overall impact of data incidents for customers.
Now that we have DMF defined, and have set the schedule for DMF validations against the relevant tables, we are only missing one final piece - Attaching the DMF to specific tables and columns.
DMFs are standalone Snowflake objects that are not attached to a specific table/column. This approach is quite elegant as it allows writing a specific DMF once (E.g., a DMF for validating emails), while using it multiple times across many different tables and columns.
Here’s an example code to attach our DMFs to the specific columns in question:
It’s time to test that our new DMFs are working properly. In order to do so, we will insert some offending rows to our tables, and see if we indeed get validation errors reported from our DMFs.
Snowflake stores all DMF results in Snowflake tables in a standard format, which makes it easy to read and view those results. Having this in a standardized format, we'd like to believe that this can become the standard way of reporting validation results in Snowflake, making interoperability across different products such as data management platforms and catalogs a lot easier.
Let’s verify that we indeed see the DMF validation errors. This query will show the latest validation result for the INVALID_ORDER_STATUS_COUNT DMF:
Please note that it may take about 5 minutes for results to arrive. Afterwards, you should expect to see output similar to this:
You can see here that the number of rows that failed the INVALID_ORDER_STATUS_COUNT
validation is exactly 1, which is the single row we added in the previous step.
Good data quality solutions provide notifications for broadcasting their results, and Snowflake’s Data Quality Monitoring does this too. For example, we can set up an email notification to be sent once per day if there are any rows that failed the validation of INVALID_ORDER_STATUS_COUNT
.
First, make sure you have email integration configured in your Snowflake account, by running the following:
Then, define a function that sends the email if the INVALID_ORDER_STATUS_COUNT
DMF has picked up bad rows. Make sure you use your real email address when using this code, so that the email will get delivered successfully:
Then, we create a recurring task that runs this function, and is scheduled to run once an hour:
For the sake of the test, we can also test it immediately without waiting for the next scheduled run, by calling the function:
At this point you can check that you’ve received the notification email. It may take a few minutes for the email to get to your inbox - once it’s arrived, the test is officially done!
Now that we are done with the test, let’s clean up all of the objects we created exclusively for this tutorial:
In this article we have learned how to configure a DMF to validate data in Snowflake, attach DMFs to tables to run on a fixed schedule, view validation results, and set up email notifications to alert about validation errors. With this knowledge, you can now go ahead and define DMFs for all the important tables in your Snowflake warehouse to make sure data invariants and data contracts are validated and enforced.
However, this process can be quite tedious, especially when you have hundreds and thousands of tables, each with dozens of different columns and a large number of data validation rules.
Fortunately, there are ways to automate a lot of this work. For example, we can infer most of the rules automatically by looking at the SQL code that’s executed in the warehouse and make specific assertions based on the actual queries that we observe. For example, if we see that a specific column is always tested against a small number of values, we can assume that this column is an enum, and accordingly create an appropriate DMF to validate it.
To make this easier, at Foundational we created a Free DMF Generator, which automatically generates DMF rules based solely on a sample SQL query. We also invite you to schedule time with us to see how Foundational can automate and keep DMFs up to date for you, so that you’d never have to worry about Snowflake data not being validated.