Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It FreeDespite the seemingly broad range of solutions, data lineage remains a hot topic and an ongoing challenge for numerous data organizations. This can be explained by multiple reasons such as the wide diversity of platforms being used, query complexity increasing, the upfront investment needed in setting everything up, and sometimes also the ongoing effort required to keep things up to date. Ultimately, all of this suggests that at least in the enterprise, data lineage is still a hard technological problem–in this article we’ll explain why that’s the case.
Data systems are constantly changing, due to new changes being introduced and occasionally also new tooling. Sometimes, data assets are also removed, schemas are changed, dashboards and tables are being updated. All of this requires data lineage solutions that need to be updated on a regular basis.
Traditionally, and that is still commonly the case today in the enterprise, the core tool which revolves around data lineage, namely the data catalog, does not really get updated on its own. It needs to be integrated into the change management process of everything involving data, and that’s quite hard to achieve.
Some data catalogs, primarily the newer, startup-based solutions, do provide more automated solutions that are mainly addressing the cloud data warehouse, primarily due to the availability of query logs, which can then be parsed and analyzed to extract data lineage. Coding language wise, these are mostly SQL, which fortunately is the data querying language that has the most options, open source in particular.
Lastly, for non-code-based tools such as Business Intelligence products (e.g., Tableau), automated data lineage utilizes APIs to query for information such as schemas, dashboard definitions, dependency mappings and others–and uses that information together with the main version.
When it comes to the data warehouse, and due to the increasing popularity of warehouse-based data architectures and specifically the vast popularity of Snowflake and BigQuery, getting basic lineage for the data warehouse has become a commodity:
The way most of these solutions work is through SQL parsing, where a lot of the parsers are being rewritten individually by each solution vendor. Most of the solutions rely on the availability of query logs and the ability to get set up as a limited user and get access to those logs.
On the surface it may sound like for the case of a data warehouse, the problem is pretty much solved. Sometimes that is indeed the case. The challenges occur when the following needs are being introduced:
Unlike the data warehouse, where query logs are readily available as SQL, in a Spark-based Lakehouse there’s a need to parse languages such as Scala, Python and Java. This is where 95% of solutions fall short, since creating a robust parser for Spark-based pipelines is genuinely hard.
In some cases, Spark-based architectures may still leverage SQL, for example in SparkSQL, or the entire architecture may leverage SQL as the main language, such as in the case of Dremio. In these cases, solutions should seemingly be more mature and easy to deploy, however that is not the case.
Ultimately, there are two reliable options for the Data Lakehouse today:
That said, everything is still a lot harder about Spark and Lakehouse-based lineage, as many teams have discovered when trying to implement these themselves:
Ultimately, it seems that most catalogs have resulted to either ingest OpenLineage events (which makes a lot of sense) or to not support at all.
Source-code-based data lineage is an emerging technology which has really come forward with the appearance of tools such as dbt and Databricks which rely on git for version control and code management. With all the relevant code being present in git, it suddenly becomes feasible to parse the source code directly. And with the introduction of AI workflows that have been trained for code analysis and code generation,
However, directly parsing the source code is still a challenge, which depending on the architecture and tools being used, may require SQL, Python, Scala, or other types of parsing to extract the exact dependencies accurately. The advantage in source-code-based data lineage is zero lag, and very little effort in integration, since all the code is readily available in cloud-based git tools such as GitHub and GitLab. Ultimately, source-code-based data lineage can be a new type of signal for data catalogs, which could ingest lineage information to automate the parts of the data lineage graph that are not easily accessible.
OpenLineage is an open standard for data lineage tracking designed to both collect lineage information in an easier way, but perhaps more importantly, foster interoperability between various data tools and platforms. If we mentioned before that most of the data lineage tools do not exchange lineage information, now with OpenLineage this could very well change.
OpenLineage provides a specification for capturing and conveying metadata about data processes, including the relationships between datasets and the transformations they undergo. By using a common format for representing lineage information, compatible tools can potentially exchange lineage information, facilitating more comprehensive visibility into data flows. For a large enterprise, which may have an enterprise catalog solution such as Purview or Unity, working with the existing catalog is a must-have.
Interestingly, OpenLineage has really matured in two areas that data catalogs have traditionally struggled with: Spark-based pipelines, and Airflow-based processes. With OpenLineage in place, extracting lineage for these technologies is a lot more pragmatic, however still requires code changes, deployment, and adoption throughout the data developers of the company, to ensure that all new pipelines include the OpenLineage functionality.
Lastly, there’s a distinction in OpenLineage between consumers, who can accept lineage data, and producers, who can send it. OpenLineage consumers, typically data catalogs, can accept OpenLineage events that relay lineage information. Similarly, OpenLineage producers, such as the Spark connector, can emit lineage information to any other tool or product who may leverage it.
There are several key criteria by which lineage solutions should be properly evaluated:
Lineage remains a complex, yet very dynamic topic of constant innovation. It’s also exciting to see new solutions emerge, whether from a technology perspective as well as from a process perspective, for example in the ability to exchange lineage information. Data lineage remains a very painful area for the enterprise.
Contact us to learn more about data lineage for your data platform
At Foundational, we believe that data lineage should be automated. By leveraging our proprietary technology, we help data organizations get visibility and better understanding of complex data environments, from the upstream sources, all the way to downstream consumption. Schedule time with us to learn more.