Blog
Articles
Boosting Data Quality with Snowflake Data Quality Monitoring

Boosting Data Quality with Snowflake Data Quality Monitoring

Articles
August 21, 2024
Barak Fargoun
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

Introduction

Data quality is always a hot topic – it’s a top priority yet it remains a key concern for the vast majority of data organizations. Ironically, there are many tools that address data quality however it’s not always easy to understand what’s the right solution for your organization. 

In this blog post we’ll explore Snowflake’s native data quality solution, called Data Quality Monitoring. We’ll review how it works and provide a step-by-step guide to create your first Data Metric Functions – DMFs – which are the tests themselves that are used by this product. We think that becoming familiarized with Data Quality Monitoring and DMFs is critical knowledge for any organization running on Snowflake, and we’ll also explore pros and cons against third-party solutions.

Let’s get started!

What is Snowflake Data Quality Monitoring?

Earlier this year, Snowflake introduced Data Quality Monitoring and Data Metric Functions (DMFs) as a powerful framework for data quality monitoring natively within Snowflake. DMFs allow you to set up custom, pre-defined rules (Called System DMFs) to validate data within Snowflake tables and columns. These rules are automatically enforced, helping you maintain high data quality across your Snowflake environment.

As of recently, Data Quality Monitoring and DMFs are officially supported as part of the Enterprise Edition, after several months in Public Preview.

Why Use Snowflake’s Native Data Quality Solution?

Snowflake’s data quality solution provides several key benefits over third-party tools:

  • Seamless Integration: DMFs are built directly into Snowflake, making them easy to use without needing additional setup or infrastructure.
  • Cost-Effective: Since DMFs are part of Snowflake, there are no extra licensing fees, and the per-unit price per test (E.g., a query that checks a table) is smaller when charged by the warehouse provider vs. a third-party which would charge a premium.
  • Real-Time Validation: One of the biggest advantages in DMFs is that they can be triggered as soon as data changes occur, catching issues immediately. This is not the case for third-party vendors which run testing in a fixed cadence.
  • Security: Data quality vendors typically need access to raw data, which is not great from a security perspective. Using DMFs minimizes third-party access to sensitive data such as PII and simplifies governance.

Of course, there are also pros for using third-party vendors for data quality:

  • Advanced Features: While DMFs are powerful, they may lack some advanced features like machine learning-based anomaly detection found in third-party tools.
  • Customization Limits: Some third-party tools might offer more flexibility for complex data quality workflows.

If your organization is using Snowflake extensively, Data Quality Monitoring and DMFs could offer a seamless, cost-effective, and most importantly, real-time solution for data quality.

Step-by-Step Guide to Implementing Snowflake Data Quality with DMFs

In the following section we’ll walk you through setting up your first DMFs to upgrade your Snowflake data quality.

Step 1: Create Tables with Sample Data

Start by creating Customers and Orders tables and filling them with sample data. Run the following in your Snowflake console:

 CREATE DATABASE IF NOT EXISTS DMF_TESTING;
CREATE SCHEMA IF NOT EXISTS TEMP;

USE DATABASE DMF_TESTING;
USE SCHEMA TEMP;

CREATE OR REPLACE TABLE customers (    
	id INTEGER AUTOINCREMENT PRIMARY KEY,    
	first_name STRING,    
	last_name STRING,    
	email STRING,    
	address STRING,    
	state STRING,    
	country STRING,    
	created_date TIMESTAMP
);

INSERT INTO customers (first_name, last_name, email, address, state, country, created_date)
VALUES    
	('John', 'Doe', 'john.doe@example.com', '123 Elm St', 'CA', 'USA', CURRENT_TIMESTAMP),    
	('Jane', 'Smith', 'jane.smith@example.com', '456 Oak St', 'NY', 'USA', CURRENT_TIMESTAMP),    
	('Robert', 'Brown', 'robert.brown@example.com', '789 Pine St', 'TX', 'USA', CURRENT_TIMESTAMP),    
	('Emily', 'Davis', 'emily.davis@example.com', '101 Maple St', 'FL', 'USA', CURRENT_TIMESTAMP),    
	('Michael', 'Johnson', 'michael.johnson@example.com', '202 Cedar St', 'IL', 'USA', CURRENT_TIMESTAMP),    
	('Sarah', 'Williams', 'sarah.williams@example.com', '303 Birch St', 'WA', 'USA', CURRENT_TIMESTAMP),    
	('David', 'Miller', 'david.miller@example.com', '404 Walnut St', 'NV', 'USA', CURRENT_TIMESTAMP),    
	('Laura', 'Wilson', 'laura.wilson@example.com', '505 Ash St', 'CO', 'USA', CURRENT_TIMESTAMP),    
	('James', 'Moore', 'james.moore@example.com', '606 Spruce St', 'OR', 'USA', CURRENT_TIMESTAMP),    
	('Anna', 'Taylor', 'anna.taylor@example.com', '707 Fir St', 'AZ', 'USA', CURRENT_TIMESTAMP);  

CREATE OR REPLACE TABLE orders (    
	order_id INTEGER AUTOINCREMENT PRIMARY KEY,    
	customer_id INTEGER,    
	order_status STRING,    
	total_value FLOAT,    
	FOREIGN KEY (customer_id) REFERENCES customers(id
);

INSERT INTO orders (customer_id, order_status, total_value)
VALUES    
	(1, 'pending', 150.00),    
	(2, 'in progress', 200.50),    
	(3, 'completed', 300.75),    
	(4, 'pending', 450.00),    
	(5, 'in progress', 100.25),    
	(6, 'completed', 250.00),    
	(7, 'pending', 75.00),    
	(8, 'in progress', 500.00),    
	(9, 'completed', 325.50),    
	(10, 'pending', 400.00);

Step 2: Create DMFs for Data Validation

Next, create DMFs to validate specific fields. Let’s consider the following examples:

  1. The order_status field is an enum that must be one of the following values: pending, in progress, completed.
  2. The country field must be a 2-character uppercase string.
  3. The email field must contain a valid email format i.e., <address>@<domain>

Here’s the SQL code for these validation rules:

 // order_status is an enum that is always set to one of these values: pending, in progress, completed
CREATE DATA METRIC FUNCTION IF NOT EXISTS  
  INVALID_ORDER_STATUS_COUNT (ARG_T table(ARG_C1 STRING))  
  RETURNS NUMBER AS  
  'SELECT COUNT_IF(ARG_C1 NOT IN (''pending'', ''in progress'', ''completed''))    
  	FROM ARG_T';     
  
// country field is always a 2-character uppercase string 
CREATE DATA METRIC FUNCTION IF NOT EXISTS  
  INVALID_COUNTRY_FORMAT_COUNT (ARG_T table(ARG_C1 STRING))  
  RETURNS NUMBER AS  
  'SELECT COUNT_IF(LENGTH(ARG_C1) != 2 OR UPPER(ARG_C1) != ARG_C1)    
  	FROM ARG_T';  
  
// email field contains a valid email format 
CREATE DATA METRIC FUNCTION IF NOT EXISTS  
  INVALID_EMAIL_COUNT (ARG_T table(ARG_C1 STRING))  
  RETURNS NUMBER AS  
  'SELECT COUNT_IF(ARG_C1 NOT LIKE ''%_@__%.__%'')    
  	FROM ARG_T';
	

Step 3: Schedule DMFs for Real-Time Monitoring

Let’s schedule our validations to run on every change made to the table. The TRIGGER_ON_CHANGES schedule will run the DMF whenever rows in the table are changed, and specifically when rows are inserted, deleted or updated (via INSERT , UPDATE or any other DML operation).

This part is important - This scheduling option is unique to DMFs and is not available in most observability and/or monitoring tools that run on a predetermined cadence. This allows Snowflake DMFs to detect problems immediately as they occur, without needing to wait for the next scheduled run which can be hours later. Increasing the cadence is usually non-viable since these runs can get expensive when done by a third-party. We therefore strongly believe that this option is a game changer for how fast data issues can get detected, and reduce the overall impact of data incidents for customers.

 ALTER TABLE customers SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
ALTER TABLE orders SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';

Step 4: Attach DMFs to Specific Columns

Now that we have DMF defined, and have set the schedule for DMF validations against the relevant tables, we are only missing one final piece - Attaching the DMF to specific tables and columns. 

DMFs are standalone Snowflake objects that are not attached to a specific table/column. This approach is quite elegant as it allows writing a specific DMF once (E.g., a DMF for validating emails), while using it multiple times across many different tables and columns.

Here’s an example code to attach our DMFs to the specific columns in question: 

 ALTER TABLE orders ADD DATA METRIC FUNCTION  
  INVALID_ORDER_STATUS_COUNT ON (order_status);
  
ALTER TABLE customers ADD DATA METRIC FUNCTION
  INVALID_COUNTRY_FORMAT_COUNT ON (country);
  
ALTER TABLE customers ADD DATA METRIC FUNCTION  
  INVALID_EMAIL_COUNT ON (email);

Step 5: Test the DMFs

It’s time to test that our new DMFs are working properly. In order to do so, we will insert some offending rows to our tables, and see if we indeed get validation errors reported from our DMFs.

 // Insert a row into the orders table with an invalid order status (e.g., a status that is not one of the allowed values).
INSERT INTO orders (customer_id, order_status, total_value)
VALUES (1, 'invalid_status', 99.99);

Step 6: View DMF Validation Results

Snowflake stores all DMF results in Snowflake tables in a standard format, which makes it easy to read and view those results. Having this in a standardized format, we'd like to believe that this can become the standard way of reporting validation results in Snowflake, making interoperability across different products such as data management platforms and catalogs a lot easier.

Let’s verify that we indeed see the DMF validation errors. This query will show the latest validation result for the INVALID_ORDER_STATUS_COUNT DMF:

 SELECT measurement_time, table_name, metric_name, value
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE METRIC_NAME = 'INVALID_ORDER_STATUS_COUNT'
AND METRIC_DATABASE = 'DMF_TESTING'
ORDER BY MEASUREMENT_TIME DESC
LIMIT 1;

Please note that it may take about 5 minutes for results to arrive. Afterwards, you should expect to see output similar to this:

You can see here that the number of rows that failed the INVALID_ORDER_STATUS_COUNT validation is exactly 1, which is the single row we added in the previous step.

Step 7: Set Up Email Notifications for Validation Errors

Good data quality solutions provide notifications for broadcasting their results, and Snowflake’s Data Quality Monitoring does this too. For example, we can set up an email notification to be sent once per day if there are any rows that failed the validation of INVALID_ORDER_STATUS_COUNT.
First, make sure you have email integration configured in your Snowflake account, by running the following:

 CREATE NOTIFICATION INTEGRATION my_email_int  
 TYPE=EMAIL  
 ENABLED=TRUE;

Then, define a function that sends the email if the INVALID_ORDER_STATUS_COUNT DMF has picked up bad rows. Make sure you use your real email address when using this code, so that the email will get delivered successfully:

 CREATE OR REPLACE PROCEDURE CHECK_INVALID_ORDER_STATUS_AND_NOTIFY()  
 RETURNS STRING  
 LANGUAGE SQL  
 EXECUTE AS CALLER
AS
$$
DECLARE
 INVALID_ORDER_STATUS_COUNT_RESULT NUMBER := 0;
BEGIN  
 -- Query the results for the INVALID_ORDER_STATUS_COUNT DMF  
 SELECT VALUE  
 INTO INVALID_ORDER_STATUS_COUNT_RESULT  
 FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS  
 WHERE METRIC_NAME = 'INVALID_ORDER_STATUS_COUNT'    
  AND TABLE_NAME = 'ORDERS'    
  AND VALUE > 0  
 LIMIT 1;
 
 IF (:INVALID_ORDER_STATUS_COUNT_RESULT > 0) THEN
  CALL SYSTEM$SEND_EMAIL(      
    'my_email_int',      
    'your_email@domain.com',  -- Replace with your email      
    'DMF Error Notification',      
    'The Data Metric Function "INVALID_ORDER_STATUS_COUNT" has reported errors. Please check the following: ' ||      
    'Invalid Order Status Count: ' || :INVALID_ORDER_STATUS_COUNT_RESULT    
 	);  
 END IF;  
 
RETURN 'Notification check completed.';
 
END;
$$;

Then, we create a recurring task that runs this function, and is scheduled to run once an hour:

 CREATE OR REPLACE TASK NOTIFY_ON_INVALID_ORDER_STATUS  
 WAREHOUSE = 'COMPUTE_WH'  -- Replace with your warehouse name  
 SCHEDULE = 'USING CRON 0 0 * * * UTC'  -- Run once per day at midnight 
UTC
AS
 CALL CHECK_INVALID_ORDER_STATUS_AND_NOTIFY();  
  
// Ensure the task is active
ALTER TASK NOTIFY_ON_INVALID_ORDER_STATUS RESUME;

For the sake of the test, we can also test it immediately without waiting for the next scheduled run, by calling the function:

 CALL CHECK_INVALID_ORDER_STATUS_AND_NOTIFY();

At this point you can check that you’ve received the notification email. It may take a few minutes for the email to get to your inbox - once it’s arrived, the test is officially done!

Step 8: Clean Up

Now that we are done with the test, let’s clean up all of the objects we created exclusively for this tutorial:

 // Verify we are on the right database (to avoid accidentally dropping real data)
USE DATABASE DMF_TESTING;
USE SCHEMA TEMP;
 
// Detach DMFs from the tables
ALTER TABLE customers  
 DROP DATA METRIC FUNCTION INVALID_EMAIL_COUNT  
 ON (email);
ALTER TABLE customers  
 DROP DATA METRIC FUNCTION INVALID_COUNTRY_FORMAT_COUNT  
 ON (country);
ALTER TABLE orders  
 DROP DATA METRIC FUNCTION INVALID_ORDER_STATUS_COUNT  
 ON (order_status);
  
// Remove the DMF definitions
DROP FUNCTION IF EXISTS INVALID_EMAIL_COUNT(
 TABLE(    
  STRING 
 )
);
DROP FUNCTION IF EXISTS INVALID_COUNTRY_FORMAT_COUNT(  
 TABLE(    
  STRING  
 )
);
DROP FUNCTION IF EXISTS INVALID_ORDER_STATUS_COUNT(  
 TABLE(    
  STRING  
 )
);

// Drop the tables
DROP TABLE IF EXISTS customers;
DROP TABLE IF EXISTS orders;

// Drop the task
DROP TASK IF EXISTS NOTIFY_ON_INVALID_ORDER_STATUS;

// Remove the stored function (that is used in the task)
DROP PROCEDURE IF EXISTS CHECK_INVALID_ORDER_STATUS_AND_NOTIFY();

// Drop the testing database
DROP DATABASE IF EXISTS DMF_TESTING;

// Delete email integration
DROP NOTIFICATION INTEGRATION IF EXISTS my_email_int;

What’s Next?

In this article we have learned how to configure a DMF to validate data in Snowflake, attach DMFs to tables to run on a fixed schedule, view validation results, and set up email notifications to alert about validation errors. With this knowledge, you can now go ahead and define DMFs for all the important tables in your Snowflake warehouse to make sure data invariants and data contracts are validated and enforced.

However, this process can be quite tedious, especially when you have hundreds and thousands of tables, each with dozens of different columns and a large number of data validation rules. 

Fortunately, there are ways to automate a lot of this work. For example, we can infer most of the rules automatically by looking at the SQL code that’s executed in the warehouse and make specific assertions based on the actual queries that we observe. For example, if we see that a specific column is always tested against a small number of values, we can assume that this column is an enum, and accordingly create an appropriate DMF to validate it. 

To make this easier, at Foundational we created a Free DMF Generator, which automatically generates DMF rules based solely on a sample SQL query. We also invite you to schedule time with us to see how Foundational can automate and keep DMFs up to date for you, so that you’d never have to worry about Snowflake data not being validated. 

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free