Data Lineage Challenges: Innovative Solutions

Our goal at Foundational is to validate every code change that may badly affect data, before it actually does. In order to do this successfully for data engineering code, for example in a framework such as dbt, we need to maintain an accurate data lineage graph, and understand every time new code is committed, what kind of lineage impact does the commit have.

Why is this important? Let’s consider one type of issue that is quite common in data, called field type mismatch. In this case, a certain field in a table is dependent on other upstream fields of the same type - for example, Boolean. If the type is changed for one of these upstream fields without changing the type of all the dependent fields, we may encounter a problem (which quite often will not get flagged). To identify this before the new code is merged, we would need to understand the lineage dependencies as well as the exact impact of the code change.

Unfortunately, this type of analysis is not available in standard lineage information offered by the platforms or even data catalogs, hence the need for code validation such as the one we’re creating at Foundational.

In this blog post we’ll focus on SQL but very similar problems exist for other languages as well. Let’s elaborate on some of these problems:

Problem #1: Ambiguities

Consider the following query:

SELECT a, b
FROM table1, table2

Analyzing this query while having the schema information is generally easy, but in the case of having only the source code, the query analyzer cannot determine which column belongs to which table. For this query, each of the following schemas could be valid:

[.code]Table1→ a,b[.code] ; [.code]Table2 → [][.code]
[.code]Table1 → a[.code] ; [.code]Table2 → b[.code]
[.code]Table1 → [][.code] ; [.code]Table2 → a,b[.code]

How can we then determine which one is it for an actual query? Here are two methods that can be used:

Sometimes that above statement is only one part of a larger query, for example it could be that the rest of the query looks like this:

(SELECT a,b
FROM table1, table2
UNION ALL
(SELECT c
FROM table1
WHERE a == '1' and b == '2')

In this case, we can identify the last part of the query that ties [.code]a[.code] and [.code]b[.code] to [.code]table1[.code] through the WHERE clause, and determine that [.code]table1[.code] is the one having these columns.

We may have references to [.code]table1[.code] and [.code]table2[.code] in other queries, and through those determine the exact relationship.

Generally, ambiguity can be solved with more information and methods like these help address it when analyzing a large set of queries, for example in a dbt project, relying on source code only.

Problem #2: Warehouse-specific SQL

While SQL is similar for the most part, there are numerous differences between SQL “flavors” used by different types of data warehouses, for example between Redshift and Snowflake as well as between Snowflake and BigQuery.

Different warehouses support different formats and functions and may also accept some types of functions while not accepting others. These behaviors may also be the reason for code issues that could lead to data incidents or performance impact.

Unfortunately, it also seems that the open-source parsers do not handle these well enough, so we had to work hard to make warehouse-specific adjustments to our parsers.

Problem #3: Advanced functions

Overall, SQL that uses advanced functions is harder to analyze, and we couldn’t reference open-source parsers, which generally do not support these well. For example, structs are not well supported, and CASE … WHEN statements are harder to analyze as well.

Another example is dynamic functions such as PIVOT, which rely on the actual data that we do not access. The approach to handling these, similarly to handling ambiguities, is to look for additional references in the subsequent code that may reference the generated columns.

Understand the impact of code changes with Foundational

At Foundational, we are solving helping data teams deploy code changes with confidence. Understanding the full impact on lineage is only one aspect of what we do – Connect with us to learn more.

Overcoming Challenges in Source-Code Based Data Lineage

Problem #1: Ambiguities

SELECT a, b
FROM table1, table2

(SELECT a,b
FROM table1, table2
UNION ALL
(SELECT c
FROM table1
WHERE a == '1' and b == '2')

Problem #2: Warehouse-specific SQL

Problem #3: Advanced functions

Understand the impact of code changes with Foundational

Related posts

Spark Lineage via Code Analysis

Foundational Now Available on AWS Marketplace!

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

Next-gen Data Management.
For Everyone

Related posts

Spark Lineage via Code Analysis

Foundational Now Available on AWS Marketplace!

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

Next-gen Data Management.For Everyone

Next-gen Data Management.
For Everyone