Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It FreeOur goal at Foundational is to validate every code change that may badly affect data, before it actually does. In order to do this successfully for data engineering code, for example in a framework such as dbt, we need to maintain an accurate data lineage graph, and understand every time new code is committed, what kind of lineage impact does the commit have.
Why is this important? Let’s consider one type of issue that is quite common in data, called field type mismatch. In this case, a certain field in a table is dependent on other upstream fields of the same type - for example, Boolean. If the type is changed for one of these upstream fields without changing the type of all the dependent fields, we may encounter a problem (which quite often will not get flagged). To identify this before the new code is merged, we would need to understand the lineage dependencies as well as the exact impact of the code change.
Unfortunately, this type of analysis is not available in standard lineage information offered by the platforms or even data catalogs, hence the need for code validation such as the one we’re creating at Foundational.
In this blog post we’ll focus on SQL but very similar problems exist for other languages as well. Let’s elaborate on some of these problems:
Consider the following query:
Analyzing this query while having the schema information is generally easy, but in the case of having only the source code, the query analyzer cannot determine which column belongs to which table. For this query, each of the following schemas could be valid:
How can we then determine which one is it for an actual query? Here are two methods that can be used:
In this case, we can identify the last part of the query that ties [.code]a[.code] and [.code]b[.code] to [.code]table1[.code] through the WHERE clause, and determine that [.code]table1[.code] is the one having these columns.
Generally, ambiguity can be solved with more information and methods like these help address it when analyzing a large set of queries, for example in a dbt project, relying on source code only.
While SQL is similar for the most part, there are numerous differences between SQL “flavors” used by different types of data warehouses, for example between Redshift and Snowflake as well as between Snowflake and BigQuery.
Different warehouses support different formats and functions and may also accept some types of functions while not accepting others. These behaviors may also be the reason for code issues that could lead to data incidents or performance impact.
Unfortunately, it also seems that the open-source parsers do not handle these well enough, so we had to work hard to make warehouse-specific adjustments to our parsers.
Overall, SQL that uses advanced functions is harder to analyze, and we couldn’t reference open-source parsers, which generally do not support these well. For example, structs are not well supported, and CASE … WHEN statements are harder to analyze as well.
Another example is dynamic functions such as PIVOT, which rely on the actual data that we do not access. The approach to handling these, similarly to handling ambiguities, is to look for additional references in the subsequent code that may reference the generated columns.
At Foundational, we are solving helping data teams deploy code changes with confidence. Understanding the full impact on lineage is only one aspect of what we do – Connect with us to learn more.