Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free[Note: This post also appeared in the OpenLineage blog]
Data lineage is the cornerstone of modern data governance, providing transparency, traceability, and accountability across the data lifecycle. It ensures that organizations can track how data flows through systems, transforms between processes, and ultimately impacts downstream analytics and decision-making. For data leaders, lineage is critical to maintaining trust in data by ensuring its accuracy, managing risks, and complying with regulatory requirements.
Beyond its role in compliance and security, lineage is essential for operational efficiency. It allows teams to perform impact analyses before making changes, reducing the risk of breaking pipelines or disrupting critical workflows. When combined with metadata and integrated into governance tools, lineage offers a powerful way to visualize data dependencies, troubleshoot issues, and ensure stewardship across the organization.
OpenLineage, the leading standard for capturing and sharing lineage, has transformed how organizations manage runtime-based lineage emitted from tools like Airflow and Spark. Yet, this focus on runtime lineage leaves a gap when it comes to static code, which governs also rarely executed or ad hoc pipelines. Addressing this gap is essential to achieving a comprehensive view of lineage across the data ecosystem.
This blog explores why lineage extracted from code is indispensable, how Foundational extracts lineage directly from code, the challenges of integrating it into OpenLineage, and how a community-driven approach can address these challenges to provide a holistic view of data lineage.
OpenLineage has excelled at capturing lineage during pipeline execution. Whether tracking Spark transformations or Airflow DAGs, its runtime-centric approach has provided unmatched visibility into active pipelines. However, relying solely on runtime lineage creates blind spots, particularly for rarely executed or complex code paths.
In these scenarios, runtime lineage cannot provide the full picture. To fill this gap, lineage extracted directly from code becomes critical. Together, runtime and code-based lineage create a comprehensive view, ensuring organizations are fully informed regardless of pipeline execution frequency.
Extracting lineage from code is a non-trivial process because it requires static analysis to understand the data flow and dependencies within the codebase. This involves examining how data is transformed and moved across various parts of the code. Foundational provides a solution for extracting lineage from code for platforms like dbt, Spark, ORMs like SQLAlchemy and other platforms by analyzing the code to identify how data is transformed and moved between different components.
Code-based lineage, like that implemented by Foundational, complements runtime lineage by providing coverage for potential lineage—a view of what could happen when the code is executed. This distinction is invaluable for several high-stakes use cases:
OpenLineage’s current model revolves around the concept of a Job, representing a runtime activity that transforms one Entity (e.g., a table or file) into another. But what happens when there is no true runtime job—when lineage comes from static code analysis instead?
In July 2023, Open Lineage introduced the concept of Static Lineage, which allows to model lineage that is not emitted from runtime, but rather statically, as “design” lineage. We leverage this, in order to model code-based lineage. This means that we use the Job object to represent the code location from which the lineage was extracted. This aligns well with the OpenLineage modeling, as Job defines a transformation while Run is the instance of that job - so in code-based lineage we can use the Job object without a Run object.
We can also use some facets of Job which are suitable for use for code-based lineage, such as SourceCodeLocationJobFacet, in order to represent additional information, such as the specific code version identifier (e.g., commit hash), repository, etc.So, for example, for a piece of code that copies data from Table1 to Table2 the lineage would be:
Table1 → <Job (points to source/foo.py)> → Table2
This approach maintains compatibility with the existing OpenLineage model while providing a path forward for integrating code-based lineage.
While modeling via the Job object is a functional starting point, it leaves room for improvement. For example:
Foundational is excited to collaborate with the OpenLineage community to refine this model and develop a standard that unites code-based and runtime lineage into a cohesive framework.
Data lineage is no longer just a nice-to-have; it’s a requirement for ensuring trust, compliance, and security in the modern data stack. OpenLineage has laid the foundation for open, runtime-based lineage, but it’s time to expand the scope.
By integrating code-based lineage, organizations can achieve full coverage of their data pipelines, capturing both what is happening and what could happen. This comprehensive approach unlocks new possibilities for compliance, security, and data engineering efficiency.
Foundational is already helping organizations extract lineage directly from their codebases, and we look forward to collaborating with the OpenLineage community to ensure this new frontier of lineage is modeled effectively. Together, we can build a lineage ecosystem that leaves no pipeline—or dependency—untracked.