Preventing Breaking Schema Changes in SQLAlchemy and ORMs

Object-Relational Mapping (ORM) is a framework that lets engineers query and manipulate data from a database using an object-oriented paradigm. When referring to ORM frameworks, most people are actually referring to a specific library that implements these queries, typically built for the specific programming language the development teams are using, for example Python. The ORM library typically wraps all the code needed to manipulate the data, so you don't use SQL anymore; you interact directly with an object in the same language you're using.

Understanding ORMs is of key importance for data teams since in a lot of cases, this is the primary area from which breaking schema changes. One such ORM framework is called SqlAlchemy. Built for Python, SQLAlchemy is used by developers to provide them with easy to use objects in Python while the underlying data is stored in a relational database, for example, Postgres. Developers can then use SQLAlchemy to define their objects in Python, using the SqlAlchemy framework, and then SqlAlchemy translates these objects to the relevant database schemas and tables.

There are many reasons for using ORMs but arguably the most important one is to create consistency around data modeling, and having a single place for the modeling layer, which makes this critical piece a lot more maintainable and transparent. At the same time, and maybe as a result, it’s also the most common place where a simple schema change can cause quite a lot of damage to downstream data systems.

Upstream changes and data incidents

When we ask data practitioners we meet with at Foundational for the common reasons for data incidents and generally data quality issues, by far, the most common reason we hear upstream schema changes. Upstream schema changes, or more generally upstream changes, may not always refer to the same thing: For a BI person, it can be a change in the Snowflake tables maintained by data engineers. For the data engineering team, upstream can be a change in the data coming in from Postgres i.e., the operational database, or a change in a Salesforce object which is managed by a different team.

There is also a strong intuition for this type of change to be the most potentially devastating one: Lack of visibility by the person doing the change in their own project, to the downstream impact on other projects and parts of the tech stack. For engineering projects, software engineering teams sometimes address these cross-project issues by putting all projects in a single repository, with a single build system, also known as a monorepo. But for the data stack, this is simply not possible, since different areas of the stack are developed by different teams, typically across different organizations such as Data and Engineering. This ends up in a situation where the software engineers do their own testing in the software engineering repositories, not knowing what’s happening in the data engineering repositories which are completely separate. There is no single CI process and cross-project dependencies are commonly ignored.

Of course, if engineers are aware that changing a schema can potentially break something downstream, they can communicate about it and find out who might be the relevant person to work with, but at a growing organization this is simply impractical, and awfully time consuming. The way to streamline these is through proper change management, where schema changes are detected automatically and are communicated to the relevant audience, in the right time–before the change can negatively impact live data.

Announcing our new integration with SQLAlchemy

At Foundational, our goal is to bridge the cross-team, cross-tool fragmentation that exists for data. We think that it should be easy and seamless for anyone who is making a change in one system, to avoid causing issues and “surprises'' to other teams. We believe that any developer who is introducing a code change, should be able to completely understand the downstream impact of their change across all other systems and tools, and have an easy way to communicate about that through git and get their change validated and approved.

We’ve built this integration to help companies who use SQLAlchemy, typically with operational databases such as Postgres and MySQL, to understand the downstream impact and get complete data coverage lineage across every part of the data stack: From the DB through the data lake, warehouse and all the way to reporting and BI.

Impact Report for an upstream schema change

How does Foundational extract lineage from SqlAlchemy? Our code analysis engine reads the SqlAlchemy files, and extracts the resulting database schema from the object definitions. This means that by analyzing the source code in Git for a customer Foundational is working with, and without access to data, we can extract the resulting database schemas needed for automated data lineage.

A developer making a schema change for a feature they need to implement in the product, can quickly validate their change is safe to deploy, and see if and whom this might affect downstream e.g., a critical dashboard used by the BI team. Organizations can also easily implement data contracts that then validate these changes against the specific organizational policies.

Between SQLAlchemy and Alembic

Alembic, often used in conjunction with SqlAlchemy, is a tool for managing database schema migrations: Users of Alembic can easily upgrade and/or downgrade their database schemas across different versions.

While Alembic is another area that manages and creates database schemas, we chose SQLAclehmcy over Alembic as the right way to implement data lineage and data contracts:

SQLAlchemy is the source of truth for database schemas, not Alembic: Companies may update SQLAlchemy models (and therefore the database schemas), but may still forget to update the Alembic migrations. We want to make sure that we catch every change that affects database schemas, and therefore we prefer to make sure we always identify it, regardless of the user forgetting to add or update Alembic.
Old Alembic migrations can be broken: Teams maintain Alembic migrations to allow for upgrades and downgrades of databases. However, over time, some migrations will eventually get stale or broken, since no one is really expecting to be able to downgrade their database to older database schemas from two years ago. Since Alembic relies on applying migrations one after the other in order to get to the final schema, having even one broken migration from two years ago can damage the analysis of the database’s schema today.
SQLALchemy scales with the size of the database, whereas Alembic scales with the number of changes to database schemas: As time goes by and more changes are introduced to database schemas, the number of schemas grows much slower than the number of schema changes. For example, a company might have 50 tables, but over 500 changes of the database schemas in the last 10 years. Relying on SQLAlchemy allows us to easily scale together with the size of the database (which is what really matters), rather than scale with the number of database changes i.e., Alembic migrations. In other words, working with SQLAlchemy is a lot better for scale.

Preventing your next data incident

Code changes and specifically schema changes are not necessarily complex. Sometimes it’s a simple rename that the developer wants to do to clean up, or removing something old that is not used anymore. Not having any check whatsoever against this, which works against all the other systems in the stack.

We’ve built the technology at Foundational that captures all of these changes in the right place, which is where the development teams are working, which is git. That’s also the right time to flag these problems, which is before they impact data, and when the developer is actively working on this code change. This seems almost too trivial to argue for, but the reality today is that for most data organizations, very few checks are in place to prevent these problems, which end up creating data problems.

Integrating natively in git also means that adopting the technology is easier - everyone is using the same tools. Schedule time with us to see this in action for your own data stack.

Preventing Breaking Schema Changes: SQLAlchemy and ORMs

Upstream changes and data incidents

Announcing our new integration with SQLAlchemy

Between SQLAlchemy and Alembic

Preventing your next data incident

Related posts

Supercharge Your AI: Connect Claude and Cursor to Your Foundational Environment

Announcing Foundational's Data Governance AI Agents

Spark Lineage via Code Analysis

Next-gen Data Management.
For Everyone

Related posts

Supercharge Your AI: Connect Claude and Cursor to Your Foundational Environment

Announcing Foundational's Data Governance AI Agents

Spark Lineage via Code Analysis

Next-gen Data Management.For Everyone

Next-gen Data Management.
For Everyone