Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It FreeObject-Relational Mapping (ORM) is a framework that lets engineers query and manipulate data from a database using an object-oriented paradigm. When referring to ORM frameworks, most people are actually referring to a specific library that implements these queries, typically built for the specific programming language the development teams are using, for example Python. The ORM library typically wraps all the code needed to manipulate the data, so you don't use SQL anymore; you interact directly with an object in the same language you're using.
Understanding ORMs is of key importance for data teams since in a lot of cases, this is the primary area from which breaking schema changes. One such ORM framework is called SqlAlchemy. Built for Python, SQLAlchemy is used by developers to provide them with easy to use objects in Python while the underlying data is stored in a relational database, for example, Postgres. Developers can then use SQLAlchemy to define their objects in Python, using the SqlAlchemy framework, and then SqlAlchemy translates these objects to the relevant database schemas and tables.
There are many reasons for using ORMs but arguably the most important one is to create consistency around data modeling, and having a single place for the modeling layer, which makes this critical piece a lot more maintainable and transparent. At the same time, and maybe as a result, it’s also the most common place where a simple schema change can cause quite a lot of damage to downstream data systems.
When we ask data practitioners we meet with at Foundational for the common reasons for data incidents and generally data quality issues, by far, the most common reason we hear upstream schema changes. Upstream schema changes, or more generally upstream changes, may not always refer to the same thing: For a BI person, it can be a change in the Snowflake tables maintained by data engineers. For the data engineering team, upstream can be a change in the data coming in from Postgres i.e., the operational database, or a change in a Salesforce object which is managed by a different team.
There is also a strong intuition for this type of change to be the most potentially devastating one: Lack of visibility by the person doing the change in their own project, to the downstream impact on other projects and parts of the tech stack. For engineering projects, software engineering teams sometimes address these cross-project issues by putting all projects in a single repository, with a single build system, also known as a monorepo. But for the data stack, this is simply not possible, since different areas of the stack are developed by different teams, typically across different organizations such as Data and Engineering. This ends up in a situation where the software engineers do their own testing in the software engineering repositories, not knowing what’s happening in the data engineering repositories which are completely separate. There is no single CI process and cross-project dependencies are commonly ignored.
Of course, if engineers are aware that changing a schema can potentially break something downstream, they can communicate about it and find out who might be the relevant person to work with, but at a growing organization this is simply impractical, and awfully time consuming. The way to streamline these is through proper change management, where schema changes are detected automatically and are communicated to the relevant audience, in the right time–before the change can negatively impact live data.
At Foundational, our goal is to bridge the cross-team, cross-tool fragmentation that exists for data. We think that it should be easy and seamless for anyone who is making a change in one system, to avoid causing issues and “surprises'' to other teams. We believe that any developer who is introducing a code change, should be able to completely understand the downstream impact of their change across all other systems and tools, and have an easy way to communicate about that through git and get their change validated and approved.
We’ve built this integration to help companies who use SQLAlchemy, typically with operational databases such as Postgres and MySQL, to understand the downstream impact and get complete data coverage lineage across every part of the data stack: From the DB through the data lake, warehouse and all the way to reporting and BI.
How does Foundational extract lineage from SqlAlchemy? Our code analysis engine reads the SqlAlchemy files, and extracts the resulting database schema from the object definitions. This means that by analyzing the source code in Git for a customer Foundational is working with, and without access to data, we can extract the resulting database schemas needed for automated data lineage.
A developer making a schema change for a feature they need to implement in the product, can quickly validate their change is safe to deploy, and see if and whom this might affect downstream e.g., a critical dashboard used by the BI team. Organizations can also easily implement data contracts that then validate these changes against the specific organizational policies.
Alembic, often used in conjunction with SqlAlchemy, is a tool for managing database schema migrations: Users of Alembic can easily upgrade and/or downgrade their database schemas across different versions.
While Alembic is another area that manages and creates database schemas, we chose SQLAclehmcy over Alembic as the right way to implement data lineage and data contracts:
Code changes and specifically schema changes are not necessarily complex. Sometimes it’s a simple rename that the developer wants to do to clean up, or removing something old that is not used anymore. Not having any check whatsoever against this, which works against all the other systems in the stack.
We’ve built the technology at Foundational that captures all of these changes in the right place, which is where the development teams are working, which is git. That’s also the right time to flag these problems, which is before they impact data, and when the developer is actively working on this code change. This seems almost too trivial to argue for, but the reality today is that for most data organizations, very few checks are in place to prevent these problems, which end up creating data problems.
Integrating natively in git also means that adopting the technology is easier - everyone is using the same tools. Schedule time with us to see this in action for your own data stack.