Automated Data Lineage: Challenges & Solutions

Despite the seemingly broad range of solutions, data lineage remains a hot topic and an ongoing challenge for numerous data organizations. This can be explained by multiple reasons such as the wide diversity of platforms being used, query complexity increasing, the upfront investment needed in setting everything up, and sometimes also the ongoing effort required to keep things up to date. Ultimately, all of this suggests that at least in the enterprise, data lineage is still a hard technological problem–in this article we’ll explain why that’s the case.

Automated data lineage, explained

Data systems are constantly changing, due to new changes being introduced and occasionally also new tooling. Sometimes, data assets are also removed, schemas are changed, dashboards and tables are being updated. All of this requires data lineage solutions that need to be updated on a regular basis.

Traditionally, and that is still commonly the case today in the enterprise, the core tool which revolves around data lineage, namely the data catalog, does not really get updated on its own. It needs to be integrated into the change management process of everything involving data, and that’s quite hard to achieve.

Some data catalogs, primarily the newer, startup-based solutions, do provide more automated solutions that are mainly addressing the cloud data warehouse, primarily due to the availability of query logs, which can then be parsed and analyzed to extract data lineage. Coding language wise, these are mostly SQL, which fortunately is the data querying language that has the most options, open source in particular.

Lastly, for non-code-based tools such as Business Intelligence products (e.g., Tableau), automated data lineage utilizes APIs to query for information such as schemas, dashboard definitions, dependency mappings and others–and uses that information together with the main version.

Implementation options for data lineage for the warehouse

When it comes to the data warehouse, and due to the increasing popularity of warehouse-based data architectures and specifically the vast popularity of Snowflake and BigQuery, getting basic lineage for the data warehouse has become a commodity:

Open-source options
Data catalogs in a variety of pricing ranges
Data observability in a variety of pricing ranges
The development frameworks (E.g., dbt Explorer)
The platforms themselves (E.g., Dataplex for BigQuery)
Runtime-based solutions such as OpenLineage
Source-code-based solutions such as Foundational
A combination of the above

The way most of these solutions work is through SQL parsing, where a lot of the parsers are being rewritten individually by each solution vendor. Most of the solutions rely on the availability of query logs and the ability to get set up as a limited user and get access to those logs.

On the surface it may sound like for the case of a data warehouse, the problem is pretty much solved. Sometimes that is indeed the case. The challenges occur when the following needs are being introduced:

Needing lineage outside of the warehouse, such as downstream for BI tools, and upstream for streaming platforms, operational databases, third-party tools, and other data sources. Coverage is definitely a huge challenge and accordingly, there’s a large variability between the different solutions.
As the codebase gets bigger, and depending on the sophistication of the data engineering teams, not all queries are now easy to parse. The open-source tools often fall short, many of the data catalog solutions are, in practice, not really handling the wide variability of queries, for example, complex macros in dbt, or introspective queries.
Lastly, by relying on query logs only, it is assumed that governance teams would accept the inherent lag that exists between the time a pipeline is changed, and the time it runs and a query log is generated.

Implementation options for data lineage for data lakes and Lakehouses

Unlike the data warehouse, where query logs are readily available as SQL, in a Spark-based Lakehouse there’s a need to parse languages such as Scala, Python and Java. This is where 95% of solutions fall short, since creating a robust parser for Spark-based pipelines is genuinely hard.

In some cases, Spark-based architectures may still leverage SQL, for example in SparkSQL, or the entire architecture may leverage SQL as the main language, such as in the case of Dremio. In these cases, solutions should seemingly be more mature and easy to deploy, however that is not the case.

Ultimately, there are two reliable options for the Data Lakehouse today:

One is OpenLineage, which collects lineage “events” from the runtime environment, and can transmit them to an OpenLineage-compatible tool
Source-code based lineage analysis, which means parsers, such as thes one provided by our team at Foundational.

That said, everything is still a lot harder about Spark and Lakehouse-based lineage, as many teams have discovered when trying to implement these themselves:

The programming languages for these environments are a lot richer, contain many different types of operators and as a result, are much harder to parse.
Processing Spark-based pipelines is a lot harder and therefore, the processing lags would typically be even longer vs. when parsing warehouse query logs.

Ultimately, it seems that most catalogs have resulted to either ingest OpenLineage events (which makes a lot of sense) or to not support at all.

Source-code-based automated data lineage

Source-code-based data lineage is an emerging technology which has really come forward with the appearance of tools such as dbt and Databricks which rely on git for version control and code management. With all the relevant code being present in git, it suddenly becomes feasible to parse the source code directly. And with the introduction of AI workflows that have been trained for code analysis and code generation,

However, directly parsing the source code is still a challenge, which depending on the architecture and tools being used, may require SQL, Python, Scala, or other types of parsing to extract the exact dependencies accurately. The advantage in source-code-based data lineage is zero lag, and very little effort in integration, since all the code is readily available in cloud-based git tools such as GitHub and GitLab. Ultimately, source-code-based data lineage can be a new type of signal for data catalogs, which could ingest lineage information to automate the parts of the data lineage graph that are not easily accessible.

Source-code-based automated data lineage — Source-code based Lineage Analysis

OpenLineage and lineage collaboration

OpenLineage is an open standard for data lineage tracking designed to both collect lineage information in an easier way, but perhaps more importantly, foster interoperability between various data tools and platforms. If we mentioned before that most of the data lineage tools do not exchange lineage information, now with OpenLineage this could very well change.

OpenLineage provides a specification for capturing and conveying metadata about data processes, including the relationships between datasets and the transformations they undergo. By using a common format for representing lineage information, compatible tools can potentially exchange lineage information, facilitating more comprehensive visibility into data flows. For a large enterprise, which may have an enterprise catalog solution such as Purview or Unity, working with the existing catalog is a must-have.

Interestingly, OpenLineage has really matured in two areas that data catalogs have traditionally struggled with: Spark-based pipelines, and Airflow-based processes. With OpenLineage in place, extracting lineage for these technologies is a lot more pragmatic, however still requires code changes, deployment, and adoption throughout the data developers of the company, to ensure that all new pipelines include the OpenLineage functionality.

Lastly, there’s a distinction in OpenLineage between consumers, who can accept lineage data, and producers, who can send it. OpenLineage consumers, typically data catalogs, can accept OpenLineage events that relay lineage information. Similarly, OpenLineage producers, such as the Spark connector, can emit lineage information to any other tool or product who may leverage it.

Evaluating data lineage solutions

There are several key criteria by which lineage solutions should be properly evaluated:

Coverage - Here, the evaluating party should be able to check the actual percentage of queries / pipelines / models that are successfully analyzed and visualized by the tool. This is probably the no. 1 requirement! We often see lineage solutions that either don't reveal the actual coverage available to users or that don't actually know or measure it themselves.
Level of effort to set up and maintain - Depending on the tool, there’s a surprisingly large variability between the typical setup times it takes to deploy a new lineage tool. There are also meaningful differences in the ongoing work needed to make sure that lineage is up to date.
Coverage of upstream pipelines - This part is key and is still often overlooked. The upstream environment is often a production environment, and parsing queries directly is highly discouraged. Alternatively, with tools such as Foundational, data organizations can analyze the underlying pipelines and cover the upstream areas as well.
Zero, or very minimal lag - Data lineage solutions should always be updated. That is highly important for trust, and trust is ultimately building efficiency and productivity in the teams that leverage the lineage information.
Actionable use cases - While important on its own, lineage should not just be about producing a visualization of the data system. Lineage should be used in tangible use case, such as root cause analysis, or downstream impact analysis, to introduce visibility and efficiency to a process that needs it. By having accurate lineage, all of these processes dramatically improve in efficiency and data trust.

Summary

Lineage remains a complex, yet very dynamic topic of constant innovation. It’s also exciting to see new solutions emerge, whether from a technology perspective as well as from a process perspective, for example in the ability to exchange lineage information. Data lineage remains a very painful area for the enterprise.

Contact us to learn more about data lineage for your data platform

At Foundational, we believe that data lineage should be automated. By leveraging our proprietary technology, we help data organizations get visibility and better understanding of complex data environments, from the upstream sources, all the way to downstream consumption. Schedule time with us to learn more.

Automated Data Lineage: Technology Review

Automated data lineage, explained

Implementation options for data lineage for the warehouse

Implementation options for data lineage for data lakes and Lakehouses

Source-code-based automated data lineage

OpenLineage and lineage collaboration

Evaluating data lineage solutions

Summary

Related posts

Spark Lineage via Code Analysis

Foundational Now Available on AWS Marketplace!

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

Next-gen Data Management.
For Everyone

Related posts

Spark Lineage via Code Analysis

Foundational Now Available on AWS Marketplace!

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

Next-gen Data Management.For Everyone

Next-gen Data Management.
For Everyone